Clusterpoint DBMS
Developer's Guide

 

 

 

version 2.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

© Clusterpoint Ltd. 2006-2011. All rights reserved

 

 

 


Important Notice

This documentation is provided as part of the Clusterpoint Server and Clusterpoint DBMS systems.

 

The content of this document is provided for reference use only, is subject to change without notice, may contain technical inaccuracies or typographical errors, and should not be construed as a commitment of Clusterpoint Ltd..

 

Clusterpoint Ltd. may make any improvements or changes in the product described in this document at any time without notice.

 

CopyrightS

2006 – 2011 Clusterpoint Ltd.  All rights reserved.

Use of this documentation is subject to the following terms:

The content of this documentation may not be altered or edited in any way. Only conversion to other formats is allowed.

You may create a printed copy for your own personal use.

For all other uses, such as selling printed copies or using (parts of) the documentation in another publication, prior written agreement from Clusterpoint Ltd. is required.

All brand names and product names used in this document are trade names, service marks, trademarks or registered trademarks of their respective owners.

 

Contact Information

Email: support <-> @ <-> clusterpoint.com (please ignore markup added for spam filtering)

Website: www.clusterpoint.com

 

 

Table of Contents


Important Notice............................................................................................................................... ii

CopyrightS............................................................................................................................................ ii

Contact Information....................................................................................................................... ii

Table of Contents........................................................................................................................... iii

Preface.................................................................................................................................................... 7

Audience.................................................................................................................................................. 7

Structure of This Guide............................................................................................................................ 7

Related Information.................................................................................................................................. 8

Typographic Conventions........................................................................................................................ 8

Abbreviations............................................................................................................................................ 8

1.    Introducing Clusterpoint Server.................................................................................. 10

1.1.     What is Clusterpoint Server (CPSE) and Clusterpoint DBMS?................................................. 10

1.2.     Clusterpoint Server in Corporate Networks................................................................................ 12

1.3.     Understanding Clusterpoint Server Environment........................................................................ 14

1.3.1.        Overview................................................................................................................................ 14

1.3.2.        Accessing Clusterpoint Server................................................................................................. 14

1.4.     Clusterpoint Server Concepts..................................................................................................... 16

1.4.1.        Clusterpoint Server................................................................................................................. 17

1.4.2.        Clusterpoint Server FTS Capability........................................................................................... 17

1.4.3.        Clusterpoint API..................................................................................................................... 17

1.4.4.        Clusterpoint Server Web Server Module.................................................................................... 17

1.4.5.        Clusterpoint Server Demons.................................................................................................... 17

1.4.6.        Clusterpoint Server Document................................................................................................. 18

1.4.7.        Clusterpoint storage............................................................................................................... 18

1.4.8.        Clusterpoint Server Vocabulary................................................................................................ 18

1.4.9.        Clusterpoint Server Document Repository................................................................................. 19

1.4.10.      Clusterpoint Index.................................................................................................................. 19

1.5.     Clusterpoint Server Architecture................................................................................................. 20

1.5.1.        Client – Server Architecture..................................................................................................... 20

1.5.2.        Multiple Storages Architecture................................................................................................ 21

1.5.3.        Multi-Server Architecture......................................................................................................... 22

1.5.4.        Understanding Storing Information in Clusterpoint Server............................................................ 23

1.5.5.        Storing and Indexing Documents in Clusterpoint storage............................................................ 23

1.5.6.        Querying Clusterpoint storage................................................................................................. 25

1.6.     Standards Compatibility.............................................................................................................. 25

1.7.     Features...................................................................................................................................... 26

2.    Understanding Clusterpoint Server Document Structure............................ 27

2.1.     Overview...................................................................................................................................... 27

2.2.     Creating Document Structure with Application........................................................................... 28

2.3.     Importing XML data with custom structure.................................................................................. 30

2.4.     Document Ordering According to Clusterpoint Information Ranking in Result Set.................... 36

2.4.1.        Overview................................................................................................................................ 37

2.4.2.        Rate..................................................................................................................................... 39

2.4.3.        Relevance............................................................................................................................. 39

2.4.3.1.      Relevance Calculation Algorithm........................................................................... 40

2.4.3.2.      Customizing Specific Weight Interval, incl. Textual ranking.................................... 42

2.4.3.3.      Results Grouping and Ordering according to Clusterpoint Information Ranking...... 42

3.    MULTI-LANGUAGE SUPPORT....................................................................................................... 45

3.1.     Multi-language Support and Character Encoding....................................................................... 45

3.1.1.        Overview................................................................................................................................ 45

3.1.2.        Storing and Searching in Single Encoding................................................................................ 47

3.1.3.        Storing in Different Encodings and Searching in Multiple Bytes per Character Encoding............... 48

3.1.4.        Storing in Different Encodings and Searching in One Byte per Character Encoding...................... 49

3.2.     Formatting XML Special Characters........................................................................................... 50

4.    Clusterpoint API Specification.......................................................................................... 51

4.1.     Overview...................................................................................................................................... 51

4.1.1.        Submitting Clusterpoint Server Commands and Receiving Replies.............................................. 51

4.1.1.1.      Exchanging XML Messages Directly.................................................................... 51

4.1.1.2.      Submitting Parameters and Receiving Formatted Replies....................................... 52

4.1.2.        XML Message Structure......................................................................................................... 52

4.2.     Clusterpoint XML Message Envelope.......................................................................................... 54

4.2.1.1.      XML Request...................................................................................................... 54

4.2.1.2.      XML Reply......................................................................................................... 55

4.3.     DATA MANIPULATION................................................................................................................ 55

4.3.1.        API commands - INSERT, UPDATE, REPLACE....................................................................... 55

4.3.1.1.      XML Request...................................................................................................... 55

4.3.1.2.      HTTP GET Parameters........................................................................................ 56

4.3.1.3.      XML Reply......................................................................................................... 56

4.3.1.4.      Binary Files Conversion........................................................................................ 56

4.3.2.        API command - DELETE........................................................................................................ 56

4.3.2.1.      XML Request...................................................................................................... 57

4.3.2.2.      HTTP GET Parameters........................................................................................ 57

4.3.2.3.      XML Reply......................................................................................................... 57

4.3.3.        API command - INDEX........................................................................................................... 57

4.3.3.1.      XML Request...................................................................................................... 57

4.3.3.2.      HTTP GET Parameters........................................................................................ 57

4.3.3.3.      XML Reply......................................................................................................... 57

4.3.4.        API command - CLEAR.......................................................................................................... 57

4.3.4.1.      XML Request...................................................................................................... 57

4.3.4.2.      HTTP GET Parameters........................................................................................ 58

4.3.4.3.      XML Reply......................................................................................................... 58

4.4.     STATUS MONITORING............................................................................................................... 58

4.4.1.        API command - STATUS........................................................................................................ 58

4.4.1.1.      XML Request...................................................................................................... 58

4.4.1.2.      HTTP GET Parameters........................................................................................ 58

4.4.1.3.      XML Reply......................................................................................................... 58

4.5.     DATA RETRIEVAL...................................................................................................................... 59

4.5.1.        API commands - LOOKUP and RETRIEVE.............................................................................. 60

4.5.1.1.      XML Request...................................................................................................... 60

4.5.1.2.      HTTP GET Parameters........................................................................................ 60

4.5.1.3.      XML Reply......................................................................................................... 60

4.5.2.        API command - SEARCH....................................................................................................... 60

4.5.2.1.      XML Request...................................................................................................... 61

4.5.2.1.1.    Search Query Syntax...................................................................................... 62

4.5.2.1.1.1.    Single Search Term................................................................................. 62

4.5.2.1.1.2.    AND...................................................................................................... 62

4.5.2.1.1.3.    Phrase Search........................................................................................ 62

4.5.2.1.1.4.    OR......................................................................................................... 62

4.5.2.1.1.5.    NOT...................................................................................................... 63

4.5.2.1.1.6.    Boolean Expressions............................................................................... 63

4.5.2.1.1.7.    Wildcard Patterns................................................................................... 63

4.5.2.1.1.8.    Ignored Words....................................................................................... 64

4.5.2.1.1.9.    Stemming............................................................................................... 64

4.5.2.1.1.10.    Search within Markup........................................................................... 65

4.5.2.1.1.11.    Proximity Search.................................................................................. 65

4.5.2.1.1.12.    Numeric Search.................................................................................... 65

4.5.2.1.1.13.    Numeric Search in More Than One Tag................................................ 67

4.5.2.1.1.14.    Case Sensitivity for Proper Names........................................................ 68

4.5.2.1.1.15.    Grouping Results.................................................................................. 69

4.5.2.1.1.16.    Filtering Results by Rate........................................................................ 69

4.5.2.1.1.17.    Web Friendly Result Navigation............................................................ 69

4.5.2.1.1.18.    Faceted Search (XML drill-down)........................................................ 70

4.5.2.2.      HTTP GET Parameters........................................................................................ 72

4.5.2.3.      XML Reply......................................................................................................... 72

4.5.3.        API command - SELECT........................................................................................................ 73

4.5.3.1.      XML Request...................................................................................................... 73

4.5.3.2.      XML Reply......................................................................................................... 73

4.5.4.        API command - SIMILAR........................................................................................................ 74

4.5.4.1.      XML Request...................................................................................................... 74

4.5.4.2.      HTTP GET Parameters........................................................................................ 74

4.5.4.3.      XML Reply......................................................................................................... 75

4.5.5.        API command - ALTERNATIVES............................................................................................. 75

4.5.5.1.      XML Request...................................................................................................... 75

4.5.5.2.      HTTP GET Parameters........................................................................................ 76

4.5.5.3.      XML Reply......................................................................................................... 76

4.5.6.        API command - LIST-LAST..................................................................................................... 76

4.5.6.1.      XML Request...................................................................................................... 76

4.5.6.2.      HTTP GET Parameters........................................................................................ 76

4.5.6.3.      XML Reply......................................................................................................... 77

4.6.     CONTEXT TRIGGERS FOR ALERTING APPLICATIONS......................................................... 77

4.6.1.        API command - ADD_TRIGGER.............................................................................................. 77

4.6.2.        API command - REMOVE_TRIGGER...................................................................................... 78

4.6.3.        API command - CLEAR_TRIGGERS........................................................................................ 78

4.6.4.        API command - EXAMINE...................................................................................................... 78

4.6.5.        Filter script configuration parameters....................................................................................... 79

4.7.     ERROR HANDLING.................................................................................................................... 79

5.    Clusterpoint Server Clustering.................................................................................... 81

5.1.     Principles..................................................................................................................................... 81

5.2.     Creating Clusterpoint Server Cluster.......................................................................................... 81

6.    Use Cases....................................................................................................................................... 83

6.1.     Use Case in C: Importing Text Files........................................................................................... 83

6.2.     Use Case in Perl: Importing Text Files........................................................................................ 87

6.3.     Use Case in PHP: Searching Clusterpoint storage and Returning Results in HTML................. 90

6.4.     Use Case in ASP: Searching Clusterpoint storage and Returning Results in HTML................. 95

6.5.     Use Case in Java: Searching Clusterpoint storage from applet................................................. 97

6.5.1.        Clusterpoint ServerJApi.java.................................................................................................... 97

6.5.2.        Clusterpoint ServerMess.java.................................................................................................. 99

6.5.3.        Clusterpoint ServerExch.java................................................................................................. 100

6.5.4.        Clusterpoint ServerXMLParser.java......................................................................................... 102

Appendix A: Error Messages..................................................................................................... 104

Error Handling...................................................................................................................................... 104

Reporting Problems............................................................................................................................. 111

Appendix B: Frequently Asked Questions.......................................................................... 112

Appendix C: STORAGE CONFIGURATION FILE............................................................................ 115

Glossary............................................................................................................................................ 118




Preface

This preface is an introduction to the Clusterpoint Server (CPSE) Developer’s Guide. It defines the audience, describes the structure of this guide, and lists typographic conventions and abbreviations used throughout the guide.

This guide is compliant with the Clusterpoint Server version 2.0.

Clusterpoint Server (CPSE) is part of the Clusterpoint Data Base Management System (DBMS).  It is a database server engine software written in C/C++ supporting Clusterpoint API.  It works as a transparent cluster database software (same copy on all hardware computers).

Clusterpoint Server operates as a database server software on any commodity 32-bit or 64-bit computer hardware. 

Clusterpoint API (application programming interface) is XML-based interface protocol between all applications and Clusterpoint Server software.

Clusterpoint Manager is administration and configuration application developed in PHP as a Web server module, that communicates with all Clusterpoint Server software instances, installed across cluster.

This section contains the following topics:

·         Audience

·         Structure of This Guide

·         Related Information

·         Typographic Conventions

·         Abbreviations

 

Audience

This guide is intended for application developers using Clusterpoint Server as a corporate search technology platform for building and operating customized applications based on Clusterpoint API.

Structure of This Guide

This guide has the following structure:

Section

Description

Introducing Clusterpoint Server

Describes Clusterpoint Server, its concepts, and architecture.

Understanding Clusterpoint Server Document Structure

Describes Clusterpoint Server document structure and presents scenarios for unstructured source data and XML structured source data.

Internationalization

Describes multi-language support and text encoding concepts, and explains XML formatting concepts.

Clusterpoint API Specification

Contains all Clusterpoint Server function descriptions and syntax in XML.

Clusterpoint Server Clustering

Describes Clusterpoint Server clustering.

Use Cases

Lists and describes sample applications that are based on Clusterpoint API commands.

Appendix A: Error Messages

Contains a list of error messages.

Appendix B: Frequently Asked Questions

Contains a list of frequently asked questions and answers to them.

Related Information

The Clusterpoint DBMS documentation includes the following guides:

Title

Description

Clusterpoint DBMS Developer's Guide

Describes how to develop custom database storage, search end indexing applications for Clusterpoint Server core software included into Clusterpoint Server package

 

 

Typographic Conventions

The following styles and conventions are used in this guide:

Convention

Description

Verdana

Represents command, function, file and directory names, system messages, and command-line commands.

Hyperlink

Represents a hyperlink. Clicking on this field takes you to the identified place.

Example

Represents an example.

Source code

Represents code.

Comment

Represents a comment in the code.

Abbreviations

The following abbreviations are used in this guide.

Abbreviation

Description

Clusterpoint Server

Clusterpoint Server (hardware + crawler and search engine software)

Clusterpoint Server

Clusterpoint Server - core Clusterpoint DBMS server software installed on any hardware that is networked into private cloud architecture to form a cluster and can be used for combined storage, RAM and CPU resources by Clusterpoint Server software when clustring functionality is used.

MANAGER

Clusterpoint Manager - Web tool for centralized administration, configuration and monitoring of all Clusterpoint Server systems: server instances in RAM servicing storages (databases), database storages, cluster storages, and underlying cluster equipment resources.

API

Application programming interface.

FTS

Full text search.

XML

Extensible markup language.

HTML

Hipertext markup language

SQL

Structured query language.

UTF

UCS (universal character set) transformation format.

HTTP

Hypertext transport protocol.

1.   Introducing Clusterpoint Server

This guide introduces Clusterpoint Server from an application developer’s perspective and provides reference material for building customized applications based on Clusterpoint Server.

This section includes the following:

·         What is Clusterpoint Server?

·         Clusterpoint Server in Corporate Networks

·         Understanding Clusterpoint Server Environment

·         Concepts

·         Clusterpoint Server Architecture

·         Standards Compatibility

·         Features

1.1.                    What is Clusterpoint Server (CPSE) and Clusterpoint DBMS?

Clusterpoint Server is the core database storage and search engine, part of the Clusterpoint database management system (DBMS).  Clusterpoint Server is providing data base information storage, access, search and retrieval used in Clusterpoint DBMS product line.  Clusterpoint Manager is an application for administration of Clusterpoint Databases, and is being included as part of Clusterpoint DBMS as well.

Other Clusterpoint products include Clusterpoint Crawler, Clusterpoint Searcher, Clusterpoint Manager, Clusterpoint Network Traffic Surveiilance System and other vertical application sector technology solutions and products provided by Clusterpoint Ltd.  Those products may include full or partial Clusterpoint DBMS, or be integrated for cohesive use with Clusterpoint Server.

The Clusterpoint Server system consists of the Clusterpoint Server and application programming interface (Clusterpoint API) for building information storage and retrieval applications.

The Clusterpoint Server is an operational platform that performs information storage, access, search and retrieval tasks by executing a predefined set of commands.

Clusterpoint API is used for developing applications that are specific and customized according to your company needs.

Note:          In this Guide sample codes for building software applications for Clusterpoint Server are delivered for most common source data formats and retrieval scenarios.  It is not possible to cover all possible scenarios, therefore in this guide we provide you only the basic reference material for building your own applications.

Nowadays, data amounts in companies are increasing very rapidly.  A lot of data contains textual information, especially in web applications.  It is either unstructured (texts, emails, documents) or semi-structured (text with some structural markup).

Another hardware technology driven advantage is that databases become more and more document-oriented, without splitting data among multiple tables and columns.  By keeping all the relevant data in one place, it is more simple and easy to manage huge databases.  Clusterpoint Server is also a document-oriented database platform, and work on XML document collections we simply call “storages”.

One of the key ways how to effectively retrieve such data from document collections and, therefore, make the data usable, is full text search (FTS). Full text search probably is one of the key competitive advantages of Clusterpoint database technology.  It underlies a very powerfull methodology implemented in Clusterpoint Server for information indexing and searching: Clusterpoint Information Ranking.

Full text search in Clusterpoint Server is based on an optimized mathematical model and our own original data ranking algorithms, which ensures very high performance for searching structured, unstructured or semi-structured information in large amounts of data, compared to traditional legacy SQL systems. For this purpose, in Clusterpoint Server, all data are stored in a special type of index: Clusterpoint Index, which is a cohesive combination of Clusterpoint Index, a graph database index and relational database indexing models.

For more information on how data are stored in Clusterpoint Server, see Understanding Storing Information in Clusterpoint Server.

Subjects for full text search can be any unstructured and XML structured data, for example, text collections, separate phrases or words in text documents, Web pages, Web addresses, several special markups for textual and numerical data, bookmarks of HTML or XML pages, domain names, SQL database entry key IDs, file names, XML field or tag names, and so on.

The following figure illustrates the Clusterpoint Server system from a high level:

Figure 1: Clusterpoint Server operational diagram

In Figure 1, all data storing, manipulating and search is implemented by Clusterpoint Server software running on the hardware, for example, it performs database search queries, update requests, status and control commands, and so on.  This functionality has been built in Clusterpoint Server core server software and can be accessed through Clusterpoint API.

1.2.                    Clusterpoint Server in Corporate Networks

Clusterpoint Server system can be integrated into an existing corporate network system. The Clusterpoint Server is incorporated into the network system just like any other server computer or an appliance. The following figure describes a sample corporate network with the Clusterpoint Server software installed hardware:

Figure 2: Clusterpoint Server in a corporate network

Application servers and transaction processors from an existing corporate network can access the Clusterpoint DBMS core software - Clusterpoint Server - as active Clusterpoint API clients; in that case, for security reasons, end users cannot directly access the Clusterpoint Server. In that way, Clusterpoint Server can be used in any corporate network, independently from the operation system, database environment, or programming language used for application development.  Normally security partitioning is done on application server level, and Clusterpoint Server can do only database storage and search functionality.

For large data amounts, Clusterpoint Server core software supports generic sever clustering, which delivers both performance scalability (for indexing), search scalability (for workload sharing among multiple identical copies of database running on different hardware) and data volume scalability (if database is clustered out into N parts or shards, each containing 1/Nth of total database content).  You can also combine those clustering options.

For more information on Clusterpoint Server multi-server architecture, see Multi-Server Architecture.

1.3.                    Understanding Clusterpoint Server Environment

This section briefly describes Clusterpoint Server software environment from developer's perspective.

This section contains the following topics:

·         Overview

·         Accessing Clusterpoint Server

 

1.3.1.  Overview

The Clusterpoint Server is core software in the Clusterpoint DBMS system that performs data storage, search  and retrieval. There is a predefined set of commands (Clusterpoint API) that are understood and executed by the Clusterpoint Server.

The commands are implemented as Clusterpoint API XML messaging requests and transported from/to Clusterpoint Server via HTTP POST or GET messages.

Note: For security reasons applications can access Clusterpoint Server through HTTPS protocol, if necessary.  Installation of digital certificates on the Web server needed for HTTPS support is not covered by this manual.  Everything related to HTTP messaging works also for HTTPS messaging.

For sending API commands to the Clusterpoint Server, first these commands must be created.  From the application developer point of view there are two different basic methods of creating and executing Clusterpoint API commands.

 

Method 1 - direct construction of Clusterpoint API XML request messages and using HTTP POST method to send the request to Clusterpoint Server.  This can be done by any programming language supporting string operations and Web services http protocol.

 

Method 2 - using HTTP GET method to send Clusterpoint API commands included into URL request with CGI parameters, where those CGI parameters follows Clusterpoint API syntax and supported Clusterpoint API command set.

 

In both cases your application has to communicate over HTTP protocol with Clusterpoint API gateway module called 'cpse-gw.cgi'.  This module automatically processes GET or POST data depending on which method you used.  This module is installed and present on each Clustepoint Server software installed hardware (on each cluster node).

 

1.3.2.  Accessing Clusterpoint Server

The following diagram describes how the Clusterpoint Server (CPSE) is accessed:

Figure 3: Accessing Clusterpoint Server

The following steps provide a general description of how the Clusterpoint Server is accessed:

1.      Users enter commands, such as search queries, for the Clusterpoint Server from a user interface of an application, for example, a Web search form.

2.      A custom built application calls a Clusterpoint API command with its parameters.

3.      The Web server receives HTTP request and passes it to the Clusterpoint Server Web server module (cpse-gw.cgi) using Comman Gateway Interface (CGI) of Web sever.

4.      The Clusterpoint Server Web server module translates each user command into an Clusterpoint XML request and submits it to the Clusterpoint Server via UNIX domain sockets (internally).

5.      The Clusterpoint Server responds to the application returning Clusterpoint XML replies that are optionally formatted, using XSLT stylesheet, or favorite scripting language (Java, PHP, Ruby on Rails, JavaScript etc.)  and then can be displayed and viewed through the application user interface, for example, a Web page.

 

If you have large complex documents or custom databases which you want to turn into searchable database, usually POST method is used to send this information to Clusterpoint Server.  You can create Clusterpoint XML messages at the application side, e.g., during indexing of custom data into Clusterpoint Server.  Using POST-based Method 1 you can index very large documents.  GET method has limits, depending on which web server you use, can be as low as 2KB in some rare cases.

Note: HTTP GET command has restrictions on the total length of all CGI paramaters present into URL.  In case of XML data messages of the size 4Kbytes or larger, you should always use HTTP POST method to communicate with Clusterpoint Server.

Method 2 is better for doing search queries and performing document retrieval as simple HTTP GET commands.  There is a server side mechanism that reads HTTP GET parameters and composes Clusterpoint XML request messages from the included parameters.  You do not have to worry about creating XML request messages in your application software, but only have to pass right parameters in URLs, from which Clusterpoint XML request messages are automatically created and internally sent to the Clusterpoint Server for processing.

One can say that HTTP GET command invokes something like a server-side pre-processor that creates the same Clusterpoint XML messages understood by Clusterpoint Server core software as if you would have used HTTP POST method.

Using HTTP POST method also gives you slightly better performance if you already have built a database of Clusterpoint XML formatted documents and just want to send them for indexing.  If you use some scripting language (such as Java, Php or Perl) for application development, using HTTP GET command can be slightly faster for querying or indexing short data items.  Clusterpoint XML message is always being constructed on a server side module written in high-speed C language.  However, performance differences usually can not be significant and our reccommendation is to choose which method suits best your application needs.

In a similar way, there is also a standard mechanisms that formats XML reply messages received from the Clusterpoint Server, for example, by using an XSLT stylesheet. Again, it is your decision whether to handle XML reply messages on the application side, or to create an XSLT stylesheet using which results received from the Clusterpoint Server are automatically formatted and can be directly passed to end users.

 

1.4.                    Clusterpoint Server Concepts

This section introduces and briefly explains Clusterpoint DBMS software platform concepts that readers must be familiar with before going into details.

This section contains the following topics:

·         Clusterpoint Server

·         Clusterpoint Server FTS Capability

·         Clusterpoint API

·         Clusterpoint Server Web Server Module

·         Clusterpoint Server Demons

·         Clusterpoint storage

·         Clusterpoint Server Vocabulary

·         Clusterpoint Server Document Repository

·         Clusterpoint Index

 

1.4.1.  Clusterpoint Server

Clusterpoint Server is a stand-alone server for storing and retrieving information such as plain texts or XML structured documents. It can be run in one or more instances per computer.

For more information, see Multiple Storages Architecture.

1.4.2.  Clusterpoint Server FTS Capability

Clusterpoint Server is designed to support retrieving information stored using full text search queries with rich enterprise search functionality.

1.4.3.  Clusterpoint API

Clusterpoint Server application programming interface (API) is a standardized set of commands for accessing the Clusterpoint Server.

1.4.4.  Clusterpoint Server Web Server Module

Clusterpoint Server Web server module is a module integrated with the built-in Web server that receives requests from an application through the Web server via HTTP POST or GET and dispatches them to the Clusterpoint Server via UNIX domain sockets.

Also, functionality of composing XML request messages from HTTP GET or POST parameters and optional formatting of the XML reply messages with a given XSLT stylesheet is included the Clusterpoint Server Web server module.

The Clusterpoint Server system is designed so that the Clusterpoint Server module can be integrated with the Web server through the Common Gateway Interface (CGI).  Optional Clusterpoint Server gateway module interfaces for Apache API or FastCGI are available, which can further increase the performance of the whole system.

1.4.5.  Clusterpoint Server Demons

Clusterpoint Server demons are internal UNIX processes implemented as part of Clusterpoint Server platform core functionality and used to store and retrieve documents.

In Clusterpoint Server Version 1.0 there were the following demons:

·         document or data handling demon 'cpse-dat'

·         a demon to create and search Clusterpoint Index (index demon 'cpse-idx')

·         a demon to build and use vocabulary of unique words (vocabulary demon 'cpse-voc')

·         a demon to manage overall system communications between all server demons on a single server and within cluster topology (manager demon 'cpse-mgr').

In Clustepoint Server Version 2.0 to improve performance and cut down interprocess communications, we have reduced number of demons per storage to only two:

·         a document storage, indexing and search demon, handling also vocabulary for that particular storage 'cps2-storage'

·         a master demon to manage overall system communications between all demons on a single server and within cluster topology (manager demon 'cps2-master').

All demons can be run as multiprocessor and multithreading processes capable of effective utilisation of available hardware resources.

1.4.6.  Clusterpoint Server Document

Clusterpoint Server document is a basic unit in the Clusterpoint storage (database) against which searching is performed. It can be unstructured (pure text) or XML structured, the later also can incorporate combination of both models, making a document semi-structured (for example, a newspaper article full text with meta structure such as author, date, source etc.).

1.4.7.  Clusterpoint storage

Clusterpoint storage is a named database (if you wish -  a collection of XML documents as basic database data objects) for storing Clusterpoint Server documents in a format that ensures that each document is uniquely identified by Clusterpoint document ID and search can be performed very fast across all documents stored in that named storage.

Each Clusterpoint storage is serviced by one Clusterpoint Server software instance in RAM, and on the disk consists of Vocabulary, Document repository, and Clusterpoint Index.

Multiple different storages (databases) can be run on a single hardware computer in parallel, serviced by separate Clusterpoint Server instances, with their own users and access rights. 

This virtualization architecture of Clusterpoint database platform can be efficiently used for safe partitioning of databases servicing different applications, all running on the same hardware equipment and efficiently using shared CPU, RAM and disk resources. 

You do not need hypeervisors or other virtualisation software to run parallel databases on the same hardware with Clusterpoint software: Clusterpoint Server architecture provides this functionality out of the box.

1.4.8.  Clusterpoint Server Vocabulary

Clusterpoint Server Vocabulary is a list of all unique “atomic” elements in the particular Clusterpoint storage: text items, strings, numbers, email addresses, dates, XML tags, URL or URI links, etc.  We sometimes call those basic elements “words” for simplicity of understanding.  Please note that a term “word” is not linguistic in Clusterpoint architecture.  It is a technical term describing any unique string found in your custom XML data objects, which are stored into Clusterpoint Vocabulary similarily like spoken language words are organized into real vocuabulary.

Clusterpoint Vocabulary is created by Clusterpoint Server during indexing: it splits all XML documents into “atomic” elements separated by delimiter characters, which can be custom configured for each named storage.  Those “atomic” elements we call “words”.  They are added to the vocabulary while storing these documents to the Clusterpoint storage.  Each Clusterpoint storage builds its own Vocabulary when new documents are being added or updated. Each word in the vocabulary has an ID of the integer type assigned to it.

Vocabulary of all unique words actually indexed per storage is stored in RAM for better performance.

Each storage has its own Vocabulary, and can be configured and fine tuned for indexing separately through Storage Configuration file.

1.4.9.  Clusterpoint Server Document Repository

Clusterpoint Server Document repository is a place where all Clusterpoint Server documents are stored in the original XML format, in which they were stored in the Clusterpoint Server system, for returning the documents on search and data retreval requests.

All documents must be uniquely identified by Clusterpoint Document ID which must be unique per named storage.  Please note, that Dcocument ID should be chosen by customer in such a way, that it is unique string value per named storage, possibly spanning multiple cluster nodes (on each cluster node storage with the same name will be treated as part of the entire named storage, and therefore Document ID must be unique per all cluster).

Each Clusterpoint storage has its own Document repository which can be a single server repository or consisting from multiple cluster nodes.  Cluster storages (same name storages on different servers, configured to work as a single logical cluster database in Clusterpoint architecture) has a distributed Document repository on all cluster nodes forming a single cluster storage, where Document repository in a particular cluster node contains only documents stored on that particular cluster node.

1.4.10.                     Clusterpoint Index

Clusterpoint Index is a customized pre-sorted index of all XML document basic “atomic” elements: words, strings, email addresses, numbers, urls etc., where each basic element has a list of pointers to Clusterpoint Server documents in which the element occurs, relations with other document parts, and for numeric and date sorting also traditional range indexes. Clusterpoint Index is pre-sorted index ranked for fast and relevant search using Clusterpoint customizable Information Ranking mechanism.

Clusterpoint Index ensures fast structured search and full text search (FTS) functionality with possibility to build different logical expressions when performing a database search.  

Each Clusterpoint storage has its own unique Clusterpoint Index, which is always organized at data storage level according to customer own defined information ranking rules and algorithms for the particular database. 

Clusterpoint DBMS provides programmable mechanism to custom rank your database content to create Clusterpoint Index.  Information ranking provides ultra fast sub-second Internet-style ad hoc search for FTS queries in a cluster and delivers the most relevant search results grouped and sorted upfront, for example, on the first results page.  Even when user queries are performed with few simple known search keywords, search query results are sorted for relevance by customer own information ranking rules using pre-sorted Clusterpoint Index.  Users performing any custom database ad hoc search get nearly instant (sub-second) Clusterpoint database response with remarkably meaningful and relevant search results, grouped and ordered for the best search experience from the customer point of view.  This feature of Clusterpoint Index also enables linear search scalability in large cluster databases, where ranked ad hoc search can be performed without performance loss characteristic to legacy SQL databases.

Please see section 2 describing Clusterpoint Information Ranking in details.

1.5.                    Clusterpoint Server Architecture

This section describes Clusterpoint Server from various architectural perspectives.

This section contains the following topics:

·         Client — Server Architecture

·         Multiple Storages Architecture

·         Multi-Server Architecture

·         Understanding Storing Information in Clusterpoint Server

·         Indexing Documents in Clusterpoint storage

·         Querying Clusterpoint storage

 

1.5.1.  Client – Server Architecture

The following figure describes Clusterpoint Server architecture from the client — server perspective:

Figure 4: Clusterpoint DBMS client — server architecture

From the client — server perspective, the Clusterpoint DBMS system consists of the client part and the server part.

On the client side, developers and administrators work with an application server, customer web application or Clusterpoint Manager utility, which initializes Clusterpoint API command calls and sends them to the Clusterpoint Server via HTTP.

On the server side, the Clusterpoint Server executes these Clusterpoint API commands accessing data in the Clusterpoint storage and sends a reply back to the client side’s application server.

 

1.5.2.  Multiple Storages Architecture

The following figure describes how multiple Clusterpoint Server instances can be run on a single Clusterpoint software installed hardware (cluster node):

 

Figure 5: Clusterpoint Server multiple storages architecture

Multiple instances of the Clusterpoint Server can be run on a single computer, which each works with its own Clusterpoint storage.  This Clusterpoint DBMS system architecture is for running customer database applications in a virtualized environment, where each server instance securely uses its own RAM and disk space, servicing only its own storage, with its own users and access security.  There is no virtualization software necessary, you can run as many as different Clusterpoint storages in parrallel on the same hardware, as your hardware capacity allows.  This also enables to utilize all hardware capacity for different database applications.  Using Clusterpoit Manager application web-interface, you can securely create, run, see and manage only particular storages, or all or of them, on any hardware servers installed in cluster and managed centrally.

 

 

1.5.3.  Multi-Server Architecture

To ensure scalability of larger amounts of data, the Clusterpoint Server can be clustered sharing a single Clusterpoint storage across many computers.

The following figure describes how a single Clusterpoint storage can be distributed on several Clusterpoint Servers:

Figure 6: Clusterpoint Server multi-server architecture

For more information on Clusterpoint Server clustering, see Clusterpoint Server Clustering.
 

1.5.4.  Understanding Storing Information in Clusterpoint Server

The following figure describes how data are imported and stored in Clusterpoint Server:

Figure 7: Storing information in Clusterpoint Server

1.       Data are entered by end users in custom built applications.

2.       Using Clusterpoint API commands data are submitted to the Clusterpoint Server via HTTP.

3.       From the submitted data, the Clusterpoint Server creates an Clusterpoint Index, Vocabulary, and Document (your original XML data objects) repository, which all are contained by the Clusterpoint storage.

For more information on the Clusterpoint storage, see Storing and Idexing Documents in Clusterpoint storage and Querying Clusterpoint storage.

1.5.5.  Storing and Indexing Documents in Clusterpoint storage

Note:    This section contains some of Clusterpoint Server system implementation details. Description provided in this section is very general and does not include implementation details for all Clusterpoint Server functionality.

Note:    The knowledge provided in this section is not required for Clusterpoint Server application developers. However, it can be found useful for a better understanding of the Clusterpoint Server system.

The following figure describes general principles how a document is stored and indexed in the Clusterpoint storage:

 

 

Figure 8: Indexing documents in Clusterpoint storage

1)      When the Clusterpoint DBMS receives an XML request containing document that must be imported in the Clusterpoint storage, the Clusterpoint DBMS Master demon[1] (cps2-master) authenticates and performs invoking of the particular strorage server instance, servicing each storage (cpse2-storage).

2)      The Master demon then sends the document to the Storage demon.  Storage demon parses the XML request and splits it into “atomic” elementar parts: words, strings, tags, numbers, urls, emails etc.

3)      The Storage demon stores the document in the Document repository, assigns a unique ID of the integer type to each document.

4)      The Storage demon stores all “atomic” data from the document also to the RAM located Vocabulary,  then translates all “atoms” in the document to unique IDs of the integer type and stores them to the Clusterpoint Index.

5)      The Clusterpoint Index is constructed from the document ID and all IDs of the “atomic” elements contained in the Clusterpoint Vocabulary, applying the Clusterpoint Information ranking rules, defined by customer in the Document Policy configuration file (see below).

6)      The resulting Clusterpoint Index links all “atomic” element IDs with the document ID and inserts them in the Clusterpoint Index, at the same time sorting index data according to the particular customized Clusterpoint Information ranking sort order, created a set of pre-sorted and highly optimized for sequential disk access indexes which later are used for ultra-fast, cluster-wide and relevant search.

1.5.6.  Querying Clusterpoint storage

Note:    This section contains some of Clusterpoint Server system implementation details. Description provided in this section is very general and does not include implementation details for all Clusterpoint Server functionality.

Note:    The knowledge provided in this section is not required for Clusterpoint Server application developers. However, it can be found useful for a better understanding of the Clusterpoint Server system.

The following figure describes general principles how a query is processed in the Clusterpoint storage:

Figure 9: Querying Clusterpoint storage

1)      When the Clusterpoint Server receives an XML request containing a query, the Master  demon (cps2-master) receives and authenticates the XML request.

2)      The Master demon then sends the query to the Storage demon (cps2-storage).

3)      The Storage demon translates all query terms from the query to “atomic” element IDs and looks up Clusterpoint Vocabulary and Clusterpoint Index, searches and returns from the Clusterpoint Index a list of document IDs, which are linked to the query term IDs, in a ranked, grouped and sorted order according to the particular Clusterpoint Index ranking rules for search relevancy.

4)      The Storage demon uses the list of document IDs to retrieve from a Document repository a result set containing a document list that matches the query, and sends it to Master demon to return it to the calling application.

1.6.                    Standards Compatibility

Clusterpoint Server is designed to comply with the following standards:

Standard

Reference

XML 1.0

http://www.w3.org/TR/REC-xml

UTF-8

RFC 2279: UTF-8, a transformation format of ISO 10646

HTTP

Hypertext Transfer Protocol

XPath 1.0

http://www.w3.org/TR/xpath

1.7.                    Features 

The Clusterpoint Server features are listed and referred to a section in this guide, in which it is described, in the following table:

Title

Section

FTS

Search

Relevance

Relevance

Multi-language support

Multi-language Support and Character Encoding

Case support

Search

Boolean expressions

Boolean Expressions

Stemming

Stemming

Wildcard search

Wildcard Patterns

Fuzzy search

Alternatives

Markup search

Search within Markup

 

2.   Understanding Clusterpoint Server Document Structure

This section describes Clusterpoint Server document structure concepts and explains strategies, if source data that you want to import into the Clusterpoint Server system, are unstructured, and if the source data are XML structured. It also describes document ordering and grouping according to Clusterpoint Information Ranking.  It also describes language and text encoding concepts.

This is probably the most important section of the whole Developer’s Guide. 

Please read it very carefully, and if your had difficulties to “grasp” our concepts, which, we know, radically differs from relational database world, then we encourage you to read this section over again. 

If you still have difficulties to understand all concepts, please do not hesitate to contact us on support  << @ >> clusterpoint.com.

This section contains the following topics:

·         Overview

·         Creating Document Structure with Application

·         Importing XML Data with Custom Structure

·         Document Ordering according to Clusterpoint Information Ranking and Result Set

2.1.                    Overview

As mentioned previously, any data can be stored in the Clusterpoint Server system and then retrieved using FTS queries (simple Internet-style ad hoc queries with any known to user query terms). Data are stored in the Clusterpoint storage as customer defined custom schema-less XML documents. A Clusterpoint Server document is the smallest unit in the Clusterpoint storage against which searching is performed. When a search request is submitted, the Clusterpoint Server searches within the Clusterpoint storage and finds all documents that match the query.

Abstracting from specific content, format, and structure, we assume that data in existing corporate filings, databases, or storages can be perceived as documents that each have a unique document ID, title, and a content consisting of textual (for good human readability) and possibly XML marked up information.  This is data in which database search, including rich enterprise full text search (FTS) functionality can be performed in Clusterpoint architecture.

An document ID can be a simple integer, an alphanumeric character string, a full file path on a file server and the file name, a URL of a Web page, or any other element that uniquely identifies a document.  It should be unique string for each document per each cluster storage (a named storage spanning multiple servers, only one cluster node can contain document with a particular ID).

Often there are also other elements; however, we will talk about them later. Also we assume that when performing a search request, what a user expects to have as a reply is a list of IDs, titles and short descriptions of those documents, which match the search request.

The Clusterpoint Server system supports the assumed default elements for importing and retrieving data. It saves a lot of time when users are performing ad hoc FTS without any predefined search forms or pre-programmed API calls.  A user-friendly FTS search queries always return default assumed data, and then it does not require to do any application level programming. 

As in relational databases you can customize the list of XML items returned per each search query.  It can be done though the Document Policy file, which is very small XML configuration file.  Using Document policy, customers can define for each storage items to be listed per simple search queries and many other default parametrs to reduce application level programming for their database search applications.  Please see below more details about Document policy configuration file.

The following sections describe how documents are imported in the Clusterpoint Server system if they are not XML structured and if they are XML structured.

2.2.                    Creating Document Structure with Application

When importing data to the Clusterpoint storage using the Clusterpoint API functions, the default document elements: ID, title, and content, are passed to the Clusterpoint Server as parameters of respective functions. When calling respective Clusterpoint API command for storing a document in the Clusterpoint storage, elements of the document are enclosed in XML tags and sent to the Clusterpoint storage.

The following figure illustrates this process:

Figure 10: Storing data in the Clusterpoint storage

The following table lists and describes all default elements for a typical Clusterpoint Server document.  Those elements specified between tags: <document> ... </document> can be used as Clusterpoint Server document data definition how database engine should index and search documents of this structure.

Element

Description

<id>

Unique document identifier in which FTS is not performed.

<title>

Document title in which FTS can be performed.

<rate>

Value of the integer type in which FTS is not performed assigned to a document with a respect to other documents. When performing a search request, search results will be ordered by rate, if not by relevance. For more information on document ordering, see Document Ordering in Result Set.

<group>

Document group. This element can be used to denote a domain of a Web document or other grouping tag, as well as, it can be used as a classifier for any kind of documents. When performing a search, it is possible to limit the number of documents from one group in the search result.

<text>

Textual information in which FTS can be performed. Clusterpoint Server also supports XML marked up information and preserves the markup, when searching in it. A snippet, which is a fragment with an occurrence of the search term, is returned to the search results.

<hidden>

Textual information in which FTS can be performed, but for which a snippet is not returned to the search results.

<info>

Additional information added to a document, but in which FTS is not performed, for example, picture files, MS Word or PDF document files, and so on. Note that these files must be appropriately formatted. For information on appropriate formatting, see Formatting XML Special Characters.

Extracting and defining these elements from source data before importing the data to the Clusterpoint Server system is an application task.

Please note that the naming of XML data elements <id>, <title>, <rate>, <text>, <hidden> and <info> is purely our own and shown for this guide purposes only.  When you create a new storage, using Cluserpoint Manager, a new default Document policy configuration file is also created, containing those default XML elements as described above. 

You are free to choose any other names of elements for your own database XML document, especially, if you already have some database with your own naming schema.  For example, you can define as document ID your custom tag <url>, or instead of <rate> you can use for the same purpose your own <timestamp>, <votes> or <date> tag. 

Document Policy file enable you to re-assign the same functionality from our default names to any other named tags.  Default names are only for this particular Guide, to help you to understand how Clusterpoint database platform works.

2.3.                    Importing XML data with custom structure

If source data is XML structured, it is not necessary to restructure it to the default Clusterpoint Server document structure described in the previous section and Figure 10.

All XML tag names in custom user documents can be named differently. 

There is a special configuration mechanism in the Clusterpoint Server how to tell the database engine what instructions to perform when processing specified XML fields - called parts of the document in this manual.

Clusterpoint Server uses this custom document structure definition mechanism, called Document policy (it is a small configuration file located in the same  directory as your storage data), to define the location and behavior for each document part. 

As a part of the document is considered any opening XML tag and the same name closing XML tag enclosing some user content.  Simply speaking, a document part is an XML tag and all content that it contains (between opening and closing tags).

Before you can store the XML structured source data to the Clusterpoint storage, the Document policy configuration for the Clusterpoint storage must be defined. The existing policy is retrieved and a new policy is defined for each Clusterpoint storage by Clusterpoint Manager Application, when creating storages and configuring document policies for them. 

By Document policy we understand a complete set of operations for data importing, search ranking and data retrieving specified as instructions to the Clusterpoint Server storage, what to do with your XML orginal data files during indexing and search.

Your original XML documents stored in Clusterpoint database themselves are not changed or modified.  Document policy configuration file only affects how Clusterpoint Server builds and uses database index: Clusterpoint Index.  Document policy configuration file contains rules how to do this indexing and what is Clusterpoint Server software “behaviour” at search and data retrieval, customizing it for your needs.  It also assigns default values to save programming code in application software.

There are rules or instructions defined for each part (each XML tag) of the document which define what instructions to apply to the document part during indexing or search. Each rule can have a different value set for the particular document part.  A single value set is called 'property'.  For example, the property  id=yes means that information of this document part will be considered as the document identification part, and the property index=all means that information of this document part will be indexed both as textual information and in addition also as textual information with preserved XML markup to enable later filtered search within this XML field content only.  Property index=text means that information will be indexed only as text, saving disk space.  There are many various property value sets in Clusterpoint database architecture, and we constantly expand them to meet our customer new needs.

The following table lists all policy properties with their values. The first value listed for a policy is the default value, in other words, the value that are set if the policy is not specified for the document part.

Property

Value

Description

id

no (default)

Information within this part will be not considered as identifier of the document.

The policy is not applied to this document part.

yes

Information within this part will be considered as identifier of the document (i.e. the primary key in legacy SQL terminology).

 

An ID can be a simple integer, an alphanumeric character string, a full file path on a file server and the file name, the URL of a Web page, or anything other that uniquely identifies a document.

 

There must be exactly one ID for a document and duplicate IDs may not exist per named storage (per cluster storage or per single storage).

rate

no (default)

Information within part will be not considered as rate of the document.

yes

An integer number within this part will be considered as rate of the document.  It is defining XML document relative ranking per storage against other XML documents, including in cluster storage, which is used at search for document sorting and grouping for output.  Can be generated from any customer defined algorithm or formula reflecting customer business needs and customer database specific search requirements.  Values in this part can be from 0 to 232.  Values may be or may not be unique: ranking is relative.

group

no (default)

Information within this part does not denotes a grouping classifier of a document, for which output grouping limits may be required to

yes

Information within this part is denotes a classifier for any kind of documents (e.g, a domain of a Web page) to limit output per group

index

no (default)

Information within this part will be stored in the document repository and available for retrieval, however, it will be not indexed in the Clusterpoint Index.

text

Textual information contained within this part is added to the Clusterpoint Index and made available for FTS.

xml

Textual information contained within this part preserving XML markup is added to the Clusterpoint Index. In this case FTS will be performed according to the XML markup.

all

The two above (xml&text) applies to this document part. It consumes more resources of memory and longer indexing time.

xml-text

Information within this part will be stored in the document repository and available for retrieval for textual content of sub-level of all child tags.  Useful, if for example, <address> is split in many subtags <city> <street>, etc. and you want to search across all of them with OR logic at this sub-level.

Permits grouping of the child tag values of an XML document part under a single search path when performing search within markup. This can replace multiple OR operations with a fast single default ad hoc query search.

facet

 

This index type is used for categorizing documents in some type of hierarchy, for example directory structure. Data later can be accessed using XPath expressions, relative to this part. Only one part can be set as index to classify for document. See more information in chapter on XML drilldown (faceted search).

 

xml-text&xml&text&facet

Switches on all modes of index policy.

 

It's possible to make exact selection of required indexing modes by joining them with symbol '&'.

alias

any valid XML element tag name that is “virtual” (not present in customer XML document)

If an alias is defined for a document part, the index will also record the contents of this part as if located in an virtual XML element named as the alias. Multiple XML parts can have the same alias element, therefore creating AND operation at search. Aliases can be used when performing search within markup. 

Information within this part will be added to <virtual-tag-name> part, which does not exist in the original customer document, but is created during indexing for consolidated search needs across different original tags, at different level of XML nesting.  All document parts values with the same ‘alias’ name will be joined as a virtual text string for <virtual-tag-name> at the index level with a blank space character value as delimiter.

Useful to consolidate for meaningful search at index level data on particular subjects such as persons, addresses etc., and perform search queries only within those virtual tags.  Does not require to add ‘technical’ tags for customer data structure such as <hidden> described in Clusterpoint Server Version 1.0 Developer’s Guide.  Avoids need to add complexity to customer XML document structure for technical reasons.

For example, all database address elements (people, company, workplace etc.) can be combined into a single alias virtual tag, to create searchable through ad hoc search index for all addresses occuring within a database at different level of nesting and for different data objects.

tag1&tag2

To set multiple alias tags use symbol '&amp' as seperator.  Sometimes a single XML tag value needs to be present in multiple different alias tags (virtual XML tags, not present in data object, but created during indexing at Clusterpoint Index level for rich enterprise search needs), for example to enable relevant ad hoc search (full text search) only in few selected groups of database items, filtering out all other parts.  Such alias groups may be multiple per database, enabling flexible customization for various ad hoc search needs.  Multiple aliases can be selected and used when performing search within markup.

weight

<min–max>

This policy works only together with the index policy with values: text, xml, or all. The range is from 1 to 100. All words contained in this part are explicitly set to be relevant to corresponding search term when performing FTS, using textual ranking as in enterprise search engines.  If both min-max values are defined as relative interval, then query terms matching text in this document part is additionally ranked for content matching relevancy (e.g. closer terms matches in text are ranked higher within specified min-max interval, than terms that are in greater text distance from each other).  If only a single weight number is set here, min and max are equal to it: textual ranking at search is switched off.  A single weight value defines fixed XML structure ranking for a document part relatively to other parts.

list

no (default)

Information within this part will be not listed in the search results.

yes

Information within this part will be listed in the search results.

highlight

Information within this part will be listed in the search results, but the search terms within this part will be highlighted.

snippet

In the search results, from this part only a snippet will be shown. The search terms within this part will be highlighted.

index-numbers

no

Numbers found in this part is are not stored and indexed for numeric range search and sorting.  Overrides default Storage configuration parameter.

 

yes

Numbers are indexed independently of text in this part for numeric range search and sorting. This part must contain numeric value. Overrides default storage configuration, number is treated as float.

Standart range based index will be created for this part of document, assuming float values, enabling  interval querying ‘mix..max’, and enabling additional results sorting at search in ascending, descending and geo-spacial order.

 

int

Numbers are indexed independently of text in this part for numeric range search and sorting. This part must contain numeric value.  Overrides default Storage configuration, number is treated as integer.

Standart range based index will be created for this part of document, assuming integer values, enabling  interval querying ‘mix..max’, and enabling additional results sorting at search in ascending, descending and geo-spacial order.

index-dates

no (default)

Information within part will not be additionally indexed as standard date timestamp based range index.

yes

Information within part will be additionally indexed as standard date timestamp based range index for date range search ‘from..to’.

Dates are indexed independently of text in this part. If there is more than one date in this element, only the last date is indexed. The available formats are (please contact us for other formats):
YYYY/MM/DD [HH:MM:SS [am/pm]]
MM/DD/YY[YY] [HH:MM:SS [am/pm]]
DD Month YYYY [HH:MM:SS [am/pm]]
Month DD YYYY [HH:MM:SS [am/pm]]

exact- match

binary

The contents of the tag are indexed byte-to-byte for exact matching purposes.  Information within this part must be treated as is, for search queries, including white spaces etc.  Useful to search exact value string, for example, content from the beginning of the document part.

 

text

The contents of the tag are indexed as a set of words exact matching purposes. Punctuation and other marks are ignored. Case-insensitive. Information within this part will be treated as test with extra trailing and ending white spaces trimmed, and in lowercase only

 

all

Combination of both “binary&text”, there is API option to select one or another during search query

 

none (default)

The tag will not be indexed for exact match

stem-lang

en

Defines that English language stemming rules to apply for this part of the document during search query, if stemming is requested by API. 

Configurable for other languages through language rules specific configuration utility that can be adopted for any other language.

 

fr, es, pt, it, ro, de, nl, sv, no, da, fi, hu, tr

Will stem in French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian and Turkish.

Other language stemming requests are welcome.

Please note that this language stemming mechanism works at API level for any search terms enclosed with ‘$’ as in $term$, automatically eapansing with OR all search query terms according to language stemming rules.

There is alternative way how to provide good enough stemming for languages with different endings, with using string template queries like in ‘stemmin*’, that works on statictics coverage base using actually indexed unique words from Vocabulary, instead of using strict language specific stemming rules.  In most cases users can use this more simple and understandable string template syntax for ad hoc search, without resorting to strict language stemming rules.

 

none (default)

The contents of the tag will not be available for stemmed search.

 

An example of storage policy configuration below can be seen as an illustration to how policy  works.  There is a set of 'rules' for each part of the document, where each part is defined in XPath notation in the policy file.  Each rule specify one or more 'property' tags for the respective part of the document with values described in the above table. 

 

In the example below the following custom storage policy configuration for the document with XML parts 'id', 'title, 'rate', 'domain' and 'text' is defined (typical document for Internet data storage).

 

<policy>

  <rule>   

             <xpath>//document</xpath>

            <property>document=yes</property>       

  </rule>

  <rule>   

             <xpath>//document/id</xpath>

            <property>id=yes</property>

            <property>list=yes</property>           

  </rule>

 

  <rule>   

             <xpath>//document/title</xpath>

            <property>index=text</property>

            <property>weight=90-100</property>

            <property>list=highlight</property>     

  </rule>

  <rule>   

             <xpath>//document/rate</xpath>

            <property>rate=yes</property>

            <property>list=yes</property>           

  </rule>

 

  <rule>   

             <xpath>//document/group</xpath>

            <property>list=yes</property>

            <property>group=yes</property>          

  </rule>

 

  <rule>   

             <xpath>//document/text</xpath>

            <property>index=text</property>

            <property>weight=10-89</property>

            <property>list=snippet</property>       

  </rule>

</policy>

 

The policy defines how the search engine will respond when indexing or searching the document storage containing data with those document parts with defined rules.   Policy rules defines what part will serve as a unique document identificator (id=yes), which parts will be listed in every search list result found (list=yes), and which parts will be indexed as text (index=text).

In the example above the policy property for document part 'text' (list=snippet) also defines that a text snippet should be generated for this part instead of the actual content.  The policy for the part 'domain' (group=yes) defines that results with the same  value should be grouped and only some first results shown in the search results list.  The policy rules also defines that the relevance of all found words in the document part 'title' should be higher (weight=90-100) compared to any such words found in the document part 'text' (weight=10-89).

 

The example above shows very simplistic policy definition.  Actual configuration for the Internet document storage can have very sophisticated and advanced policy configurations with tens of rules and hundreds of properties, depending on the end user needs and application specifics.

The (policy, rules, property values) shema is very simple and efficient way how to flexibly accomodate all kind of application needs for custom indexing and relevancy requirements. 

You can build storages accomodating specialty search needs for combination of text, XML and numeric data search, you can design multi-zone relevance definitions for documents to improve search results quality to your end users, you can define documents with special structure having parts not to be indexed, you can manage the size of the index and performance by fine tuning indexed and non-indexed parts etc.

Clusterpoint is adding new Document policy property values for policy definitions along with new features supported by each new version of databasse server core software.

 Please check for any new policy property values when new versions of Clusterpoint Server software are downloaded or installed.

 

2.4.                    Document Ordering According to Clusterpoint Information Ranking in Result Set

This section describes how documents are ordered in a result set. It describes the two mechanisms conceptually and contains the following topics:

·         Overview

·         Rate (document ranking among themseleves)

·         Relevance (XML data structure ranking)

·         Textual ranking (full text search query ranking, a subset of Relevance ranking)

 

2.4.1.  Overview

There are three basic information ranking mechanisms in the Clusterpoint system how documents are ordered in a search result set:

·         By Document <rate> value, which must be assigned by the application to each document before storing it to the Clusterpoint storage and is independent from a search request; this can be any customizable value or algorithm, assigning value for ranking documents between themselves, if all other rankings atre equal;

·         By customer assigned fixed relevance to specific XML structure parts, which is calculated when performing a search and which ensures that documents that are matching to a search request in that particular XML structure, if this part of document is ranked higher, are displayed and grouped first in a result set; it ranks any XML structure according to customer own needs;

·         By textual content relevance to a particular search query, using advanced enterprise search (full text search) contentual matching rules, finding the documents that are closer matching the search query, depending on frequency and possition of search terms within textual content in the particular XML part;

The Document rate ensures high performance of the search function, the XML-structure relevance ensures the quality of FTS in the structured search results (in XML structure) and textual relevance ensure quality of textual content matching.

Actually XML-structure relevance (fixed relevancy) and textual relevance (floating interval of relevancies) is designed in such a way, as it produce one overall search relevancy score in combined manner, and thus can be ordered at very high quality level for semi-structured data most common on Internet and Web today. 

Sorting by overall relevancy score is a default mechanism that is applied every time a simple Internet-style ad hoc search is performed.

Sorting only by document rate is an option that you can choose additionally when a search is performed.

The decrease of performance due to the sorting by rate is minimal.

The rate and relevance mechanisms are illustrated by an example in the following figure:

Figure 11: Document rate and relevance

The query contains the search function that must return documents containing the word 'houses'.

Each document in the Clusterpoint storage has the Document rate assigned: the document A has a rate=5000, and the document B has the rate=3000.

The following table presents the sequence of the two documents in the result set, when the search function uses the relevance for document ordering, and when it does not, in other words, when the relevance is on and when the relevance is off.

Document and its rate

Relevance off

Relevance on

Document A rate=5000

1

2

Document B rate=3000

2

1

When searching with the relevance off, only the Document rate is considered, and, therefore, documents with higher rates are displayed first. In the example, the document A has a higher rate than the document B, and, therefore, the document A is displayed first.

When searching with the relevance on, place where the search term appears in the document is considered, and, therefore, documents that contain the search term in parts that are more important than other parts, in other words, have a higher specific weight, are displayed first. In the example, the document A contains the search term in its text part, whereas the document B contains the search term in the document title, which has the higher specific weight than the text. Therefore, the document B is displayed first.

2.4.2.  Rate

The rate is a number of the integer type in the range from 0 to 4294967295=232-1, which must be assigned by the application to each document when storing it to the Clusterpoint storage.

The rate allows significant optimizations for large data amounts, which ensures high performance of the Clusterpoint Server system in massively clustered systems.

It is an application developer’s task to create an effective document ranking algorithm for assigning document rate to document collections that is appropriate and satisfies user needs and business rules, for example, alphabetic order, by document publication or creation date, or objective document importance, or customer project importance etc.

If the Document rate is not assigned or if there are documents with the same Document rate, the default document order in a result set is a reverse of the document storing sequence to the Clusterpoint storage.

In a named Clusterpoint storage, multiple Document rate parts can be assigned, and multiple assigning algorithms can be used.  At API level you can specify which rate must be used for ordering.  XML parts used for additional (second, third etc.) ordering must be indexed with Document policy ‘index-numbers’ or ‘index-dates’, and specifically requested for this ordering during search by API.

If your application requires several ordering types for the same document collection, then you must several rate-assigning algorithm for each rate tag.

Technically, assigning the Document rate to documents is setting an integer value for the <rate> element. For more information on the Clusterpoint Server document structure, see Creating Document Structure with Application.

2.4.3.  Relevance

The relevance is a number of the integer type, that is a measure of the accuracy of the search results, which is calculated according to:

1)      For your XML-structure your own defined fixed relevancy weight values for parts of your custom documents (you rank your XML data structure, telling the Clusterpoint Server which parts are more important as other parts, when search terms hit those parts);

2)      the relevancy weight interval (min..max) assigned by you for those your document textual parts, where you want to apply classic enterprise search rules for better textual content matching to search queries, in which the search term appears (depends on search query matches and document actual content), where overall relevancy is calculated based on the number of times the search term appears compared to other documents, the distance between the search terms in the document, if multiple words are being searched, and position of search terms in document.

Please note that a fixed weight relevancy and relevancy weight intervals provide relative ranking to each other, and can be freely combined according to your own specific search needs.  A document part with a higher specific weight interval than other document parts with fixed ranking would mean that this part is considered as more important than the other parts. For example, the document title is more important than the document text, if either of the two described data relevance ranking methods produce higher overall relevancy score for a title.

In the Clusterpoint Server system, there is a common overall relevance calculation algorithm, which is implemented according to the two basic relevance ranking mechanisms described above in this section.

The specific weight interval can be customized to best reflect your document structure textual parts for a good search experience expected by your users.  The fixed weights can be used for very exact search results grouping and positioning, for example, in paid search advertisement applications where exact positioning is of paramount importance, or other applications.  With fixed relevancies you can group your search result set into up to 100 different by relevance groups, where search results per each group can be additionally sorted by document rate per each group, providing ultra-high ranked search quality for your web applications.

Fore more information on the Clusterpoint Server relevance calculation algorithm, see Relevance Calculation Algorithm.

Fore more information on setting your own specific weight, see Customizing Specific Weight Interval.

2.4.3.1.      Relevance Calculation Algorithm

This section describes general principles of the Clusterpoint Server overall relevance calculation algorithm.

Note:      This section contains some of Clusterpoint Server system implementation details. Description provided in this section is very general and does not include implementation details for all Clusterpoint Server functionality.

The Clusterpoint Server overall relevance calculation algorithm consists of two parts that are performed when:

·         storing and indexing documents to the Clusterpoint storage

·         searching documents in the Clusterpoint storage

Steps of the Clusterpoint Server overall relevance calculation algorithm are described generally. To ensure a better understanding of the algorithm, an example is also provided. Each step is followed by the example part that reflects the step.

1.       When storing documents to the Clusterpoint storage, specific weight for each word (by ‘word’ we mean all “atomic” elements such as text words, strings, email addresses etc.) in adocument is calculated as follows:

1.1    In each document part, the specific weight is calculated for each word according to the specific fixed relevancy weight or relevancy weight interval (min-max) of the document part the word occurs (assuming content is textual).

The specific weight for a word in a document part is the minimum value of the following:

·         minimum value of the specific weight interval of the document part plus a number of times the word occurs in the document part

·         maximum value of the specific weight interval of the document part

Note:       The specific weight interval minimum and maximum can be the same fixed value. In that case, for all words in such document part, no matter how often they appear, the specific weight in the document part is the same: the specific weight value of the document part.  It is then in essence transforming into simpe XML structure ranking with fixed values.

Example:

A document consists of three document parts: heading, description, and note. Each document part contains words w1, w2, and w3 and has its own specific weight interval, as described in the following figure:

Figure 12: Calculating specific weight for each document

w1(heading)=min(80,80)=80, w1(description)=min(20+1,50)=21, w1(note)=min(10+4,12)=12

w2(heading)=0, w2(description)=min(20+3,50)=23, w2 (note) min(10+1,12)=11

w3(heading)=0, w3(description)=min(20+1,50)=21, w3 (note) min(10+2,12)=12

1.2    The maximum value of specific weights of a word in all document parts is assigned as the specific weight of the word in the document.

Example (continued):

max(w1(heading), w1(description), w1(note))=80

max(w2(heading), w2(description), w2(note))=23

max(w3(heading), w3(description), w3(note))=21

When searching documents in the Clusterpoint storage, the relevance of the document according to the search request is calculated as follows:

2.1    Specific weights of all search terms in a document are summed.

Example (continued):

Σ(w1, w2, w3) = max(w1(heading), w1(description), w1(note)) + max(w2(heading), w2(description), w2(note)) + max(w3(heading), w3(description), w3(note)) = 124

2.2    The relevance is calculated by multiplying the sum from the previous step with a value that is calculated taking into the account the distance between the search terms in the document: the greater the distance, the smaller the value

Example (continued):

Relevance = Σ(w1, w2, w3) * d

2.4.3.2.      Customizing Specific Weight Interval, incl. Textual ranking

This section describes how to set specific weights or weight intervals for document parts that best reflects your XML document structure (XML structure ranking).

As described in the previous section, a specific weight or specific weight interval for a document part is an interval between two integer numbers.

By default, the following specific weights intervals are defined:

 Document part

Minimum

Maximum

Title

100

100

All except Title

1

99

For a Title there is a fixed relevancy weight assigned where Min and Max value is the same: 100.

For other parts a textual ranking specific relevancy weight interval 1-99 is assigned, producing overall relevancy during search within this specified range.

You can set a different value for the title part, and you can define a separate specific weight interval for each document part, such as Text and Hidden, or other document parts that you have, to ensured more detailed relevance calculation.

Because of the performance considerations, there is a limit for the maximum specific weight values, which is 100.

2.4.3.3.      Results Grouping and Ordering according to Clusterpoint Information Ranking

You may think about relevance ranking in relative % of importance of your custom XML document parts, assigning higher % to those parts, where you want search hits to order your results higher.  In the above example, search hits is Title will be at 100% relevance and always grouped upfront, but text relevancy will be callculated according to interval-based ranking of textual content, and will be lower positioned than matches in the Title.

In this way you can efficiently group and order search results up to 100 groups with unique relevancies, mixig and matching both XML-fixed structure ranking defined results relevancies and textual content ranking relevancies.

Whenever overall calculated relevancy at search is equal (for example, there are millions of matching documents containing a search term in Title, all having overall identical relevancy 100% as in our example), the results within same group relevancy is additionally ordered according to Document rate.  This is the foundation of Clusterpoint customizable information ranking system.

This enables to uniquely rank more than 400 billion data items in your database for exact positioning and ordering for the best relevance from the user point of view, in case of search query matches, even if your FTS search query generates thousands and millions of results.  For example, for an worldwide Internet index if you search simply for ‘London’, there are maybe 1,3 billion results (Relevance from 1 to 99% in our sample), with some 74 million results having a search hit in Title (Relevance 100% in our sample).  What do do with that overwhelming amount of data?

Therefore Clusterpoint Server software additionally sorts results according to Document rate values, for each same resulting relevancy value group.  If you Document rate assigning algorithm, for example, calculates number of pages linking text ‘london’ to particular web page, the first results on your web page will most likely be the most pupolar London travel, city municipality and tourism resources.  Suddenly your search results, performed on huge seemingly overwhelming database, becomes meaningful for end-users.

You can design and develop your own customizable Clusterpoint Information ranking algorithms, which can be as easy or as complex as your business rules require. 

To customize specific relevancy weight or weigth intervals for XML document parts (XML data structure ranking with additional enterprise search ranking for textual content) you should define the Document policy rules for those parts assigning a fixed weight.  A relevancy weight interval you sould assign for only that textual content you would like to rank additionally using standard enterprise search engines methods.  Both types of information ranking rules can be assigned through Document Policy.  Add your own flexible Document rate calculation algorithm, and you would always enable to bring the most relevant data out of your database on the first web page.

This mechanism of Clusterpoint Information Ranking does not slow down database search, as the custom ranking is built into the pre-sorted Clusterpoint Index during indexing and data storage phase.  There is no need ot do massive sorting of information at search.  It has been already organized for mostly sequential disk access, according to your custom ranking rules, and search response times are always sub-second.

As soon as you store new documents in Clusterpoint Storage, or update ones, Clusterpoint Index is updated, applying information ranking to all index elements.  In other words, you can organize any custom Clusterpoint Index for fast and relevant search, based on your own information rankoing algorithms and rules, instead of relying on someone else to organize your database information.  This our indexing model is also massively scalable in cluster (from a single server to 1000s of servers, if necessray) and delivers sub-second search results from a large cluster nearly without performance loss.

The existing Document policy is retrieved and a new Document policy is defined for each Clusterpoint storage by Clusterpoint Manager Application, when creating storages and configuring Document policy files for them.  For more information please see Clusterpoint Server User Guide, Creating and Configuring Storages.

To customize Document rate values, you have to develop your own algorithm to assign <rate> tag with an integer value, or select some existing XML tag for document ranking algorithm.

Combining all three Clusterpoint database information ranking methods described above (for XML structure, for textual content and for documents), you can achieve the maximum search performance and maximum user satisfaction searching your database.

They can start enjoying ultra-fast and user-friendly Internet-style ad hoc (FTS) search that delivers the most relevant data for web applications almost instantly, independently of the total database size and without performance loss in clustered hardware setup.

There is also other benefits of using Clusterpoint database information ranking to improve your productivity.

You can also configure your database for the best search relevancy rules without any application software programming in Clusterpoint architecture. 

You can also eliminate integration with enterprise search tools in your application software, often trying to archieve desired relevance by ranking of search queries in application software, or building and maintaning complex integrated SQL and related external full text search indexes in application software.  Many legacy enterprise search systems struggle to work at high speed in clusters, they do not scale out well.

Clusterpoint Information Ranking can be customized on Document Policy configuration file level.  Document rate algorithms and relevancy rules for a particular database, once assigned and fine tuned, would rarely need to be changed.

Information ranking facility that is built into the Clusterpoint Server software, also makes Clusterpoint database scalable in clusters: Clusterpoint Server software automatically creates ranked by your own database search rules Clusterpoint Index, distributes indexing workload among cluser nodes, and provides index generic scalabilty for search and replication within a cluster.

You do not need to address clustering logic in your application software: this is another great advantage of Clusterpoint database architecture.  Form an application software developer point of view, any Clusterpoint database will be just a single logical database storage of XML documents, no matter on how many cluster servers the total database content is located.

3.   MULTI-LANGUAGE SUPPORT

This section describes multi-language support and character encoding concepts, provides examples for different character encoding cases, and explains XML formatting concepts.

This section contains the following topics:

·         Multi-language Support and Text Encoding

·         Formatting XML Special Characters

3.1.                    Multi-language Support and Character Encoding

The Clusterpoint API structure is based on XML, which means that all character encoding related issues adhere XML internationalization standards.

For more information on XML internationalization standards, see http://www.w3.org/TR/REC-xml.

Clusterpoint Server provides a complete multi-language support, by automatically performing all necessary character encoding conversions.

You can import documents in different languages and encodings in a single Clusterpoint storage, as well as you can perform search queries in different languages and encodings in a single Clusterpoint storage.

This section contains the following topics:

·         Overview

·         Storing and Searching in a Single Encoding

·         Storing in Different Encodings and Searching in a Multiple Bytes per Character Encoding

·         Storing in Different Encodings and Searching in One Byte per Character Encoding

 

3.1.1.  Overview

For document importing and searching Clusterpoint Server supports any language and text encoding. When importing documents to the Clusterpoint storage, internally, all documents are converted and stored in the UTF-8 encoding, as illustrated in the following figure:

Figure 13: Importing documents with different encodings

In Figure 13, document encodings are represented as encoding values each in a white box with a double dotted border.

Data exchange between an application and the Clusterpoint Server software is performed in the XML format. In the XML format, data can be in any encoding; this encoding is defined in the XML header of the document.

All Clusterpoint API functions have an encoding parameter, which defines the encoding of textual data. This encoding is used in the XML header, when importing documents to the Clusterpoint storage as described in Creating Document Structure with Application. The textual data must comply with the encoding defined in a function parameter, or else the system returns a parsing error.

The number of encodings is only limited to those that are installed on hardware on which the Clusterpoint Server is run. To find out what encodings are installed on the Clusterpoint Server computer, see the Clusterpoint Server User Guide. For example, on RedHat Linux, usually, US-ASCII, ISO8859-1..13, WINDOWS-1250..1258, UTF-7, UTF-8, UTF-16, and UTF-32 encodings are installed.

Technically, only the encoding is important to the Clusterpoint Server system, which means that you can store and search data in the Clusterpoint Server system in any language as long as you supply a valid encoding for that language.

There are the following two types of encodings:

Title

Description

one byte per character

Contains 256 characters, which means that, within one such encoding, characters for several similar languages can be included, for example, WINDOWS-1250 and ISO8853.

multiple bytes per character

Contains all UCS (universal character set) characters, which include characters for almost all languages, for example, Greek, Cyrillic, Korean, and so on.

You can store documents in different languages with different encodings within a single Clusterpoint storage; documents are converted to the UTF-8 encoding, which contains all characters from UCS and, therefore, all characters are preserved correctly.

Search results are returned in the encoding that is used for the search request.

The following three sections contain examples with different cases of working with a single and several encodings, which demonstrate the Clusterpoint Server multi-language support.

3.1.2.  Storing and Searching in Single Encoding

This section contains an example, when a single encoding is used for document storing and retrieving.

The following figure illustrates the example:

Figure 14: Storing and searching in single encoding

In Figure 14, document encodings are represented as encoding values each in a white box with a double dotted border.

1)      All documents are imported to the Clusterpoint storage in the same encoding. In Figure 14, the encoding is ISO-8859-1 for French, which encodes French character 'ç' and other characters that are not included in a US-ASCII encoding.

2)      Users submit search queries to the Clusterpoint storage in the same encoding as the document source encoding.

3)      Search results are returned to a result set in the same encoding.

4)      The search results are displayed with correct characters to the user.

Note:      The user computer must have appropriate fonts installed for viewing that encoding. Older browser versions may not support the UTF-8 encoding and display the special characters as question marks ?. In that case, the browser must be updated.

3.1.3.  Storing in Different Encodings and Searching in Multiple Bytes per Character Encoding

This section contains an example, when different encodings are used for document storing and a multiple bytes per character encoding is used for retrieval.

The following figure illustrates the example:

Figure 15: Storing in different encodings and searching in multiple bytes per character encoding

In Figure 15, document encodings are represented as encoding values each in a white box with a double dotted border.

1)      Documents are imported to the Clusterpoint storage in different encodings. In Figure 15, the encodings are ISO-8859-1 for French and ISO-8859-15 for German.

2)      Users submit search queries to the Clusterpoint storage in a multiple bytes per character encoding, in Figure 15, the encoding is UTF-8.

3)      Search results are returned to a result set in the encoding, which is used in the search request, in Figure 15, the encoding is UTF-8.

4)      The search results are displayed with correct characters to the user.

As in this case the multiple bytes per character encoding is used, there are no problems for displaying characters for both languages.

 

Note:      The user computer must have appropriate fonts installed for viewing that encoding. Older browser versions may not support the UTF-8 encoding and display the special characters as question marks ?. In that case, the browser must be updated.

3.1.4.  Storing in Different Encodings and Searching in One Byte per Character Encoding

This section contains an example, when different encodings are used for document storing and a one byte per character encoding is used for retrieval.

The following figure illustrates the example:

Figure 16: Storing in different encodings and searching in one byte per character encoding

In Figure 16, document encodings are represented as encoding values each in a white box with a double dotted border.

1)      Documents are imported to the Clusterpoint storage in different encodings. In Figure 16, the encodings are ISO-8859-1 for German and ISO-8859-15 for German with Euro symbol ''.

2)      Users submit search queries to the Clusterpoint storage in a one byte per character encoding, in Figure 16, the encoding is for old German codepage.

3)      Search results are returned to a result set in the encoding, which is used in the search request, in Figure 16, the encoding is for German old codepage without Euro.

4)      The search results are displayed with correct characters to the user.

.

Characters that are not in the encoding are returned as XML entities.  For example, in Figure 16, Euro symbol '€' (if present in data) is returned as &#8364;.

For more information on XML entities, see http://www.w3.org/TR/REC-xml.

Note:      The user computer must have appropriate fonts installed for viewing that encoding. Older browser versions may not support the UTF-8 encoding and display the special characters as question marks ?. In that case, the browser must be updated.

3.2.                    Formatting XML Special Characters

As mentioned in earlier sections, data are sent from an application to the Clusterpoint storage in the XML formatting. Therefore, the data must comply with XML formatting rules, for example, the data cannot contain XML special characters like <, >, and &, which are used for the XML markup, instead, &lt;, &gt;, and &amp must be used respectively.

Example:

If you have a title A&B, you must convert it to A&amp;B.

For more information on the XML formatting rules, see http://www.w3.org/TR/REC-xml.

4.   Clusterpoint API Specification

This section generally describes all Clusterpoint API specification, which is implemented in XML.

This section contains the following topics:

·         Overview

·         Clusterpoint XML Message Envelope

·         Data Manipulation

·         Status Monitoring

·         Data Retrieval

·         Error Handling

4.1.                    Overview

This section contains the following topics:

·         Exchanging Messages

·         XML Message Structure

 

4.1.1.  Submitting Clusterpoint Server Commands and Receiving Replies

XML request and reply messages are exchanged between the application and the Clusterpoint storage via HTTP with the port 80 as the default.

As mentioned earlier, it is possible to transport Clusterpoint Server commands to the Clusterpoint Server and receive replies as XML messages and, also it is possible to submit HTTP GET parameters and receive formatted replies.

Both options are described in the following sections:

·         Exchanging XML Messages Directly

·         Submitting Parameters and Receiving Formatted Replies

 

4.1.1.1.      Exchanging XML Messages Directly

The following figure illustrates submitting Clusterpoint Server commands and receiving replies via XML messages directly:

Figure 17: Exchanging XML messages directly

A request is sent as a POST method.

As the HTTP resource identification, the URL http://host/cgi-bin/cpse/cpse-gw.cgi must be used, where <host> is the Clusterpoint Server host name.

4.1.1.2.      Submitting Parameters and Receiving Formatted Replies

The following figure illustrates submitting Clusterpoint Server commands as HTTP GET parameters and receiving formatted XML replies:

Figure 18: Submitting parameters and receiving formatted replies

A request is sent as a GET or POST method.

As the HTTP resource identification, the URL http://host/cgi-bin/cpse/cpse-gw.cgi must be used, where <host> is the Clusterpoint Server host name. Command specific parameters must be included in query string or passed as POST data.

4.1.2.  XML Message Structure

As described previously, each XML message contains a command name, content data that are specific for the command, and other information, such as user name and request identifier, which is common for all XML messages and included in the so called XML message envelope.

For more information on the XML message envelope, see Clusterpoint XML Message Envelope.

The following figure illustrates the common part for all XML messages and content part that is specific for each command:

Figure 19: XML message: common part and content part

Description of Clusterpoint API commands is organized so that the common part is described in Clusterpoint XML Message Envelope, and only the content parts are described for each command in separate sections named after the command.

XML elements are presented as they appear in messages and each XML element is described within its tags.

The command syntax consists of an XML request and an XML reply, and as mentioned, XML requests can be submitted as HTTP GET or POST parameters. To describe XML request, XML reply, and HTTP GET parameters syntax, each section contains the following subsections:

Subsection

Description

XML Request

Lists all XML request elements that specific for the command as they appear in XML request messages. Each element is described within its tags. The description within the tags ends with an asterisk *, if the element is mandatory.

HTTP GET Parameters

Describes HTTP GET parameter syntax in the form of an example. The example looks as follows:

http://host/cgi-bin/cpse/cpse-gw.cgi?param1=value&param2=value

where:

·         host is Clusterpoint Server IP address or a host name,

·         param1, param2, and so on are HTTP GET parameters,

·         value is a parameter’s value.

Note:      In examples HTTP GET parameters are described, however, you can submit also POST parameters.

XML Reply

Lists all XML reply elements that are specific for the command as they appear in XML reply messages. Each element is described within its tags.

Some elements in XML requests, and thus, respective parameters, if submitting the XML request as HTTP GET parameters, are mandatory, and some are not. The mandatory elements are marked with an asterisk * in the XML request description.

However, there are some XML request elements that are mandatory only if submitted as XML request, but are not mandatory if submitted as HTTP GET parameters. Such parameters first must be defined in the Clusterpoint Server Web server module configuration file, and then, do not have to be submitted each time when sending a command. Parameters that can be defined in the Clusterpoint Server Web server module configuration file are the following:

·         user name

·         user password

For more information on the Clusterpoint Server Web server module configuration file, see the Clusterpoint Server User Guide.

4.2.                    Clusterpoint XML Message Envelope

This section describes the common parts of the XML request and reply for all Clusterpoint API commands.

4.2.1.1.      XML Request

<?xml version=”1.0” encoding=”REQUEST-ENCODING”?>

<cpse:request xmlns:cpse=”www.clusterpoint.com”>

      <cpse:storage>storage name*</cpse:storage>

      <cpse:command>command name*</cpse:command>

      <cpse:timestamp>message creation date and time</cpse:timestamp>

      <cpse:requestid>message number</cpse:requestid>

      <cpse:application>creator of message</cpse:application>

      <cpse:user>user name*</cpse:user>

      <cpse:password>user password*</cpse:password>

      <cpse:timeout> function timeout period </cpse:timeout>

      <cpse:reply_charset>reply encoding</cpse:reply_charset>

      <cpse:content>command specific data            </cpse:content>

</cpse:request>

4.2.1.2.      XML Reply

<?xml version="1.0" encoding=”REPLY-ENCODING”?>

<cpse:reply xmlns:cpse=”www.clusterpoint.com”>

      <cpse:storage>storage name</cpse:storage>

      <cpse:timestamp>reply creation date and time</cpse:timestamp>

      <cpse:content>command specific data</cpse:content>

      <cpse:command>command name for which the reply is created</cpse:command>

      <cpse:requestid>message number for which the reply is created</cpse:requestid>

      <cpse:seconds>time period for the reply creation</cpse:seconds>

      <cpse:replyid>unique message id created by the Clusterpoint Server</cpse:replyid>

</cpse:reply>

4.3.                    DATA MANIPULATION

This section describes the following data manipulation commands:

·         Insert, Update, and Replace

·         Delete

·         Index

·         Clear

 

4.3.1.  API commands - INSERT, UPDATE, REPLACE

The insert command adds a document to the Clusterpoint storage. If a document with such ID already exists, the command returns an error.

If a document with such ID exists in the Clusterpoint storage, the update command updates the document. If a document with such ID is not in the Clusterpoint storage, the update command adds it to the Clusterpoint storage.

The replace command replaces contents of a document in the Clusterpoint storage. If a document with such ID is not in the Clusterpoint storage, the command returns an error.

4.3.1.1.      XML Request

<cpse:content>

      <document>document content            <document>

</cpse:content>

Where the document content consists of document structure elements. The default Clusterpoint Server document structure is as follows:

<document>

      <id> document id * </id>

      <title> document title </title>

      <rate> document rate </rate>

      <domain> document domain  </domain>

      <info> meta data </info>

      <text> textual information, which is used for indexing </text>

      <hidden> textual information, which is used for indexing, but which is not shown</hidden>

</document>

For more information on the default Clusterpoint Server document structure, see Creating Document Structure with Application.

4.3.1.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=insert&storage=test&id=1&title=Doc1

http://host/cgi-bin/cpse/cpse-gw.cgi?command=update&storage=test&id=1&title=Doc1

http://host/cgi-bin/cpse/cpse-gw.cgi?command=replace&storage=test&id=1&title=Doc1

4.3.1.3.      XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.1.4.      Binary Files Conversion

Binary files conversion is integrated feature for the insert, update, and replace commands. The binary files conversion functionality converts binary file contents into plain text. Thus, it is possible to add several Microsoft Office files and other binary files to the Clusterpoint storage and perform full text search on them.

The following table lists extensions of binary files that can be added to the Clusterpoint storage:

Extension

Description

DOC

Microsoft Word document.

XLS

Microsoft Excel document.

PPT

Microsoft PowerPoint document.

RTF

Rich text format document.

PDF

Adobe portable document format document.

PS

Post script document.

To use the binary files conversion functionality, in the XML request, in the place of the text tag, use the file tag in the following format:

      <file store=’’yes/no <!--If store=”yes”, then the original document is stored in the Clusterpoint storage and returned when retrieved. The default value is “no”-->>

                  <ext> extension of binary file </ext>

                  <data> data of binary file converted to the base64 encoding </data>

      </file>

As described in the data tag, binary file contents first must be converted to the base64 encoding. This is because XML does not support storing binary data within a tag.

4.3.2.  API command - DELETE

The delete command deletes a document from the Clusterpoint storage. If a document with such ID is not in the Clusterpoint storage, the command returns an error.

4.3.2.1.      XML Request

<cpse:content>

      <document>

                  <id>document id *</id>

      </document>

</cpse:content>

4.3.2.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=delete&storage=test&id=1

4.3.2.3.      XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.3.  API command - INDEX

After inserting, updating, replacing, or deleting documents in the Clusterpoint storage, the Clusterpoint Server must permanently save the changes made to the Clusterpoint Index. The Clusterpoint Server is able to make the decision, when to start saving the changes to the Clusterpoint Index, on its own. However, to optimize performance, for large data amounts, it is recommended to inform the system when a portion of documents are loaded and in the nearest time period more documents are not to be loaded, in other words, the Clusterpoint Server can allocate all resource for the process of indexing.

The index command tells the Clusterpoint Server to start the process of indexing.

4.3.3.1.      XML Request

The <cpse:content> element does not contain any command specific data.

4.3.3.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=index

4.3.3.3.      XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.3.4.  API command - CLEAR

The clear command deletes all documents from the Clusterpoint storage. This command should be used only when a complete re-indexing of the Clusterpoint storage is necessary.

4.3.4.1.      XML Request

The <cpse:content> element does not contain any command specific data.

4.3.4.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=clear

4.3.4.3.      XML Reply

If the command is executed successfully, the XML reply does not contain any command specific data.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.4.                    STATUS MONITORING

This section describes the Status command.

4.4.1.  API command - STATUS

The status command returns status information of the Clusterpoint Server instance. The status information includes:

·         number of documents in the Clusterpoint storage

·         number of words in the vocabulary

·         total number of words in the Clusterpoint storage

·         number of executed commands since the last startup of the instance

·         number of errors that have occurred since the last startup of the instance

4.4.1.1.      XML Request

The <cpse:content> element does not contain any command specific data.

4.4.1.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=status

4.4.1.3.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <status>

                  <mgr>

                              <started> date and time, when the Clusterpoint Server was started </started>

                              <age>time period the Clusterpoint Server is working since it was started</age>

                              <total_time_elapsed>total time spent by the Clusterpoint Server sever executing commands</total_time_elapsed>

                              <transactions><--This element contains information about executed commands-->

                                          <total> total number of commands executed</total>

                                          <successful>number of commands that were successfully executed</successful>

                                          <failed> number of commands that were executed unsuccessfully </failed>

                                          <requests command="command name">number of times the command was executed </requests> <-- This element is repeated for every command that was executed.-->

                              </transactions>

                              <last_modified> date and time, when modifications in Clusterpoint storage occurred last time </last_modified>

                              <queue> number of commands executed simultaneously </queue>

                              <version> Clusterpoint Server version number</version>

                  </mgr>

                  <idx> <-- This element contains information about the Clusterpoint Index.-->

                              <journal>

                                          <usage> indexing memory cache usage in percent</usage>

                              </journal>

                              <pool_state> index state: normal, expanding, or collapsing</pool_state>

                  </idx>

 

                  <voc> <-- This element contains information about the vocabulary.-->

                              <unique_words>unique words in the Clusterpoint storage</unique_words>

                              <total_words>total number of all words</total_words>

                  </voc>

                  <dat>

                              <documents>total number of documents</documents>

                              <domains> number of distinct domains of documents</domains>

                  </dat>

      </status>

</cpse:content>

When importing data to the Clusterpoint storage:

·         If the memory reserved for memory cache is enough for the data amount being imported, the index state is normal.

·         If the memory reserved for memory cache is not enough for the data amount being imported, the index state is one of the following:

Title

Description

expanding

The data being imported are written to another cache, which is written to the disk.

collapsing

When the importing is complete, the Clusterpoint Server is committing data written on the disk to the Clusterpoint storage.

Note:    While the index state is expanding or collapsing, the data written to the disk are not available for FTS. Only when data are added to the Clusterpoint Index, they are available for FTS.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.                    DATA RETRIEVAL

This section describes the following data retrieval commands:

·         Lookup and Retrieve

·         Search

·         Select

·         Similar

·         Alternatives

·         List last

4.5.1.  API commands - LOOKUP and RETRIEVE

The lookup command searches for a document in the Clusterpoint storage and returns the information whether the document with such ID exists is in the Clusterpoint storage or it does not.

The retrieve command returns a document from the Clusterpoint storage. If a document with such ID is not in the Clusterpoint storage, the command returns an error.

4.5.1.1.      XML Request

<cpse:content>

      <document>

                  <id>document id *</id>

      </document>

</cpse:content>

4.5.1.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=lookup&storage=test&id=1

http://host/cgi-bin/cpse/cpse-gw.cgi?command=retrieve&storage=test&id=1

4.5.1.3.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <found>indicator 1 or 0 if a document is found or not, respectively</found>

      <results>

                  <document>

                              meta data for the lookup command

                              textual information for the retrieve command

                  </document>

      </results>

</cpse:content>

Meta data for the lookup command is information included in tags, for which the policy list is set to YES. By default, these are id, title, and rate tags.

For more information on policies, see Importing XML Data With Custom Structure.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.2.  API command - SEARCH

The search command performs FTS in the Clusterpoint storage.

4.5.2.1.      XML Request

<cpse:content>

      <query> search query *</query>

      <docs> number of documents in the result set </docs>

      <offset> intend from the beginning of the result set</offset>

      <case_sensitive> Boolean type parameter: YES to enable case sensitivity of the first letter of words when performing the search, NO not to enable case sensitivity </case_sensitive>

      <relevance> Boolean type parameter: YES to order results by relevance, NO not to order results by relevance </relevance>

      <snippet> Boolean type parameter: yes (default) to display snippets and no- not to display </snippet>

      <highlight> Boolean type parameter: yes (default) to do text highlighting against search query, no – not to do </highlight>

      <group_size> Maximum number of documents per result group. Results from one domain are grouped together within one result page. If the parameter is not set, the default value is 0, which implies that no grouping by domains is performed and no limit is set. </group_size>

      <rate_from> searching documents with in a rate range: the FROM value </rate_from>

      <rate_to> searching documents with in a rate range: the TO value </rate_to>

      <wildcards> <!-- This element contains parameters for configuring wildcard patterns support. -->

                  <allow> Information whether the wildcard patterns search is enabled. Values “yes” or “no”.</allow>

                  <cover_factor> When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for to ensure a higher performance. This element defines the limit in percent from the sum of all words created from the wildcard pattern appearance in the Clusterpoint storage.</cover_factor>

                  <min_expand> The minimum limit of the wildcard patterns matching set from the Clusterpoint storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if only 2 words fall in the cover_factor, but the min_exapand is 4, then 4 words are being used in the search.</min_expand>

                  <max_expand> The maximum limit of the wildcard patterns matching set from the Clusterpoint storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if 20 words fall in the cover_factor, but the max_exapand is 16, then only 16 words are being used in the search.</max_expand>

      </wildcards>

</cpse:content>

If values for the wildcards tag are not defined, corresponding parameters set in the Clusterpoint storage configuration file are used.

For more information configuring Clusterpoint storage, see the Clusterpoint Server User Guide.

This section contains the following topics:

·         Search Query Syntax

·         Case Sensitivity for Proper Names

·         Grouping Results by Domain

·         Filtering Results by Rate

·        Web Friendly Result Navigation

 

4.5.2.1.1.                Search Query Syntax

Clusterpoint Server provides several mechanisms for specifying your search query. Each mechanism has a definite syntax, which is described in the following subsections. For a better understanding, each subsection also contains an example of the mechanism described and an explanation about what the example search query returns.

This section contains the following topics:

·         Single Search Term

·         AND

·         Phrase Search

·         OR

·         NOT

·         Boolean Expressions

·         Wildcard Patterns

·         Ignored Words

·         Stemming

·         Search within Markup

·         Proximity Search

·         Numeric Search

4.5.2.1.1.1.              Single Search Term

To search for documents that contain a single search term, the search term must be entered as is.

Example:

George returns documents that contain the word “George”.

4.5.2.1.1.2.              AND

To search for documents that contain all of the several terms, but which are not necessarily next to each other, the search term must be separated by the space character.

Example:

George Brown returns documents that contain the word “George” and the word “Brown”.

4.5.2.1.1.3.              Phrase Search

To search for documents that contain an exact phrase, the search phrase must be enclosed in the quotations marks.

Example:

“George Brown” returns documents that contain the exact phrase “George Brown”.

4.5.2.1.1.4.              OR

To search for documents that contain any of the search terms, the search terms must be enclosed in { } and separated with the space character.

Example:

{George Brown} returns documents that contain either the word “George” or the word “Brown”.

4.5.2.1.1.5.              NOT

To search for documents that do not contain the search term, the search term must be preceded with ~.

Example:

~George returns documents that do not contain the word “George”.

4.5.2.1.1.6.              Boolean Expressions

AND, OR, and NOT logical connectives can be combined in more complex search expressions using the brackets ( ), which allows you to build any Boolean expression.

Example:

{(George Brown) (Mary Green)} returns documents that either contains the word “George” and the word “Brown”, or the word “Mary” and the word “Green”.

{(A B ~C) ”D E”} is parsed in the expression tree as follows:

Figure 20: Search query expression tree

4.5.2.1.1.7.              Wildcard Patterns

To search for documents that contain a class of words represent:

·                                 exactly one unknown character using the question mark ?

·                                 one or more unknown characters using the asterisk *

·                                 range of definite characters for one unknown character occurrence using the square brackets [ ]

Note:          When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for. This limitation is introduced to preserve the high performance of the Clusterpoint Server. However, the maximum number of the words being searched can be increased or decreased, when configuring the Clusterpoint Server. For more information on configuring the Clusterpoint Server, see the Clusterpoint Server User Guide.

Example:

ma? returns documents that contain the word “map”, “maple”, “make”, “made”, and so on.

Geo* returns documents that contain the word “George", “Geotermal”, “Geology”, and so on.

ma[py] returns documents that contain only the word “map” or “may”.

c?[au]* returns documents that contain the word “counter”, “club”, “chapter”, “country”, “change”, “chat”, “council”, “class”, “cpu”, “challenge", “church”, “couple”, “championship”, and so on.

4.5.2.1.1.8.              Ignored Words

By default, Clusterpoint Server indexes all common words and characters such as “and”, ”where”, and “how”, as well as certain single characters and single letters.

Unfortunately  they tend to slow down the search without improving the search results. Common words and characters like this are called ignored words.

The Clusterpoint Server can be configured to ignore common words described above to detect words that appear in the search queries to Clusterpoint storage most often from the customer supplied ignored words list. It is possible to edit the limit of the ignored words list.

If a common word or a character is essential to getting the search results you want, you can include it by preceding it with a plus sign +.

Example:

George +and Mary returns documents that contain all three words: “George”, “and”, and “Mary”.

4.5.2.1.1.9.              Stemming

It is possible to include in one search request a word and its declinations, for example, “go” and “going”.

This feature is especially useful for so-called synthetic languages, in which syntactic relations within sentences are expressed by the change in the form of a word that indicates distinctions of tense, person, gender, number, mood, voice, and case, for example, German, Russian and Latin.

To search for documents that contain words in declinations, a word or a phrase must be enclosed in the dollar signs $ $.

Example:

$George$ returns documents that contain the word “George” and “George’s”.

4.5.2.1.1.10.          Search within Markup

To search for documents that contain the search term in a specific tag, the search term must be enclosed in the appropriate tags.

Note:          The searching within markup can be performed only if the document policy rule property index with values xml or all is used. For the default document structure the index policy rule with the value xml is set by default. For more information on defining storage policy see, Importing XML Data With Custom Structure.

Example:

<person>George Brown</person> returns documents that contain the word “John” in the <person> tag and the word “Smith” in the <person> tag.

{<person>George</person> <address>”Great Britain”</address>} returns documents that either contains the word “George” in the <person> tag, or the phrase “Greate Britain” in the <address> tag”.

4.5.2.1.1.11.          Proximity Search

It is possible to define maximum of words, which appear between certain search terms. These search terms are also defined in the search query. Such feature is called proximity search.

To use the proximity search feature, the search term must be as follows:

@ N term1 term2 @,

where N is the maximum count of words between the search terms, and term1 and term2 are search terms. There can be any number of search terms included in the proximity search.

If N is 1, then the search is exactly the same as if the phrase search was used.

For more information on the phrase search, see Phrase Search.

Example:

@ 4 phone fax @ returns documents that contain the words “phone” and “fax” not further than 4 words from each other.

4.5.2.1.1.12.          Numeric Search

Due to the fact that the Clusterpoint Server is indexing not only text information, but it also indexes numeric information, it is possible to perform numeric search. Numeric search allows searching documents that contain numeric values within a numeric interval.

For example, each document contains information about an object including geographic coordinate information. In that case, the numeric search can be performed to retrieve all objects in definite range of geographic coordinate. Thus, Clusterpoint Server can be used in online maps, where people can find information on different objects in a definite area.

The numeric search can be performed only together with a text search.

Numeric values in documents are indexed and stored as floating points, no matter if they are integers or floating points in original documents.

Fraction part is stored up to the sixth digits.

To use the numeric search functionality, the search term must be as follows:

·                                             To perform numeric search within a range of two numeric values, enter _textual search term_X .. Y, where X is the minimum value of the search numeric value, and Y is the maximum.

·                                             To perform numeric search for a document that contain numeric value greater than the given, enter _textual search term_>X.

·                                             To perform numeric earch for a document that contain numeric value smaller than the given, enter _textual search term_<X.

It does not matter if textual search term is entered before or after numeric search term.

Example:

Document content:

<document>

      <id>76541</id>

      <title>George’s profile</title>

      <text>

            <name>George Brown</name>

            <age>26</age>

      </text>

</document>

Search query that matches the document:

<query>

      <name>George</name> <age>20.. 30</age>

</query>

<numeric_ordering>center</numeric_ordering>

Note:      For performing numeric searching for one document tag, as in the previous example the <age> tag, only one numeric interval can be entered. If you enter more than one numeric interval for one tag, then nothing is returned since numeric intervals are joined with the AND logical operation.

For information on performing numeric search for more than tags, see Numeric Search in More Than One Tag.

The <numeric_ordering> tag in the example denotes the order in which search results must be returned.

Possible values for numeric ordering are the following:

Title

Description

none

No numeric ordering is applied.

center

Results that are closer to the mean value of the numeric search interval are returned first.

This value is allowed only for numeric search within a range of two numeric values.

ascending

Numeric search results are returned in ascending order.

descending

Numeric search results are returned in descending order.

4.5.2.1.1.13.          Numeric Search in More Than One Tag

It is possible to perform numeric search in more than one tag. It means that for each tag that contains numeric information a numeric search range can be performed.

Example:

Document content:

<document>

      <id>76541</id>

      <title>George’s profile</title>

      <text>

                  <name>George Brown</name>

                  <age>26</age>

                  <children>4</children>

      </text>

</document>

Search query that matches the document:

<query>

<name>George</name> <age>20 .. 30</age> <children>&gt; 3</children>

</query>

<numeric_ordering>center</numeric_ordering>

Numeric search in more than one tag is especially useful and necessary for geographic coordinate searching, where it is necessary to search for an object by its longitude and latitude.

Note that '>' symbol in the example above should be denoted as '&gt;' according to XML encoding standard to avoid XML parsing errors.

For numeric search in more than one tag result ordering is combined in one for all tags.

The following table describes result ordering is combined:

Ordering type

Description

ascending

Results are ordered ascending by the sum of all numeric values from tags in which the numeric search is performed.

descending

Results are ordered descending by the sum of all numeric values from tags in which the numeric search is performed.

center

Ordered by shortest distance to the center of intervals in multi-dimensional space where each dimension represents a tag in which the numeric search is performed.

Distance to the center of intervals in multi-dimensional space is calculated by the following formula:

(x-xc)/xr*(x-xc)/xr + (y-yc)/yr*(y-yc)/yr + …+ (z-zc)/zr*(z-zc)zr,

where

x, y, z are numeric search intervals

xc, yc, zc are centers of each interval, respectively

xr, yr, zr are half of numeric interval range, respectively.

Numeric search functionality in several tags or in several dimensions has additional feature that allows returning numeric search results that match:

·                                             a hypercube of all numeric intervals, which is default, or

·                                             only a hypersphere of all numeric intervals.

For example, if geographic coordinates of ATMs in a city are indexed, it is possible to search for an ATM that is not farther than 1 kilometer from a definite location. That is, you need to retrieve only those ATMs that match the circle (a hypersphere with 2 dimensions in this case) with a radius of 1 kilometer.

If in the previous example, the default numeric search is performed, results that match a square with the side length 2 kilometers are returned. This means that also ATMs that are square root of 2, which is approximately 1.41, are returned.

As said before, the default value for the multi dimensional shape feature is a hypercube. Value for the multi dimensional shape feature is defined in the <md_shape> tag, which is included in the Clusterpoint Server command syntax.

Possible values for the <md_shape> tag are the following:

Ordering type

Description

cube

Results that match a hypercube are returned.

sphere

Results that match a hypersphere are returned.

Example:

Document content:

<document>

      <id>76543</id>

      <title>Gas station</title>

      <text>

                  <name>Springtown</name>

                  <x>3.2</x>

                  <y>5.7</y>

      </text>

</document>

Search query that matches the document and finds gas stations within the distance of 1 kilometer from point (4.0, 6.0):

<query>

<name>gas</name> <x>3.0 .. 5.0</x> <y>5.0 .. 7.0</y>

</query>

<numeric_ordering>center</numeric_ordering>

<md_shape>sphere</md_shape>

4.5.2.1.1.14.          Case Sensitivity for Proper Names

It is possible to perform case sensitive search for proper names, which means that case sensitivity is applied for the first letter of a search term.

The case sensitivity feature is switched on or off by setting the <case_sensitive> parameter in the search command’s XML request.

For more information on the search command’s XML request, see XML Request.

Example:

If the <case_sensitive> parameter is set to YES, and the search query contains “Bank”, then the search command returns documents, in which the word “Bank” is with the first capital. Note that in this case, also documents, in which the word “BANK” is with all capitals, are returned, since the case sensitivity is applied only to the first letter of a search term. 

 

4.5.2.1.1.15.          Grouping Results

It is possible to set the maximum number of documents in a search result that are returned form one group specified by a grouping tag. If this feature is used, in the search result, documents from one group tag value are grouped together within one result page.

You can define which XML tags use for grouping in Document policy with ‘group=yes’ policy.

The grouping results by selected group feature is defined by setting the <group_size> parameter in the search command’s XML request larger than 0.

If the parameter is not set, the default value is 0, which implies that no grouping by group tag is performed and no limit is set.

<cpse:content>

  ...

  <group> tag name of tag for which groups were created </group>

  <group_size> number of documents returned from one group (default 0 - no grouping performed) </group_size>

  ...

</cpse:content>

For more information on the search command’s XML request, see XML Request.

 

4.5.2.1.1.16.          Filtering Results by Rate

It is possible to filter search results by document rate by setting the minimum and maximum of the rate range within which the rate of a document must be to appear in the search result.

Document rate is of the integer type. However, it is possible to convert any date and time into integer using the UNIX timestamp, which converts a date and time into amount of seconds from 01/01/1970 till the given date and time. Thus, it is possible to set date and time as document rate and to search for document within a certain time interval.

The filtering results by rate feature is defined by setting the <rate_from> and <rate_to> parameters in the search command’s XML request.

For more information on the search command’s XML request, see XML Request.

 

4.5.2.1.1.17.          Web Friendly Result Navigation

Clusterpoint Server is designed for use in Web and Intranet applications in mind. In many cases to display results in Web, the paging functionality is used. The paging functionality implies that the search result records are divided in parts, where each part is displayed in its own page, and each part contains a fixed amount of records.

The Web friendly result navigation feature is defined by setting the <docs> and <offset> parameters in the search command’s XML request.

For more information on the search command’s XML request, see XML Request.

Example:

If the <docs> parameter is set to 10, and the <offset> parameter is set to 30, the search command returns results from 30 till 39.

4.5.2.1.1.18.          Faceted Search (XML drill-down)

Faceted search, sometimes called also an XML drilldown, is feature that allows to return search results with additional information, called facets, that are grouping documents into hierarchical structure by category, theme etc. togeher with search hits per category, and then can be used for search narrowing or expansion in this ‘categorized’ structure without reentry of search query. Using this feature, you can create very useful extra navigation interfaces for navigating dynamically generated per each query catalogues, directories, themes, taxonomies and much more.

 

Setting ‘index=facet’ policy rule

This Document policy rule should be first set for those tags, e.g., ‘category’, for which facets should be generated.

<rule>

        <xpath>//document/category</xpath>

        <property>index=facet</property>

</rule>

<rule>

        <xpath>//document/author</xpath>

        <property>index=facet</property>

</rule>

 

Policy rule for this feature should be set before indexing the data in order to use this Faceted search (XML Drilldown) functionality.  Facets are created on index level only, to enable fast access without slow sorting and counting per query.

Document import

Once you have set Document policy defining which tags you want to be treated as facets at search, you can store and index documents into storage.

<?xml version="1.0" encoding="utf-8"?>

<cpse:request xmlns:cps="www.clusterpoint.com">

  <cpse:storage>news</cpse:storage>

  <cpse:command>insert</cpse:command>

  <cpse:user>root</cpse:user>

  <cpse:password>password</cpse:password>

  <cpse:content>

    <document>

      <id>news_1</id> <!-- article id, primary key -->

      <title>Lorem ipsum dolor sit amet</title>

       <teaser>Nam neque metus, pulvinar a dapibus in, mattis et  neque. Duis  sollicitudin ultricies nisl, ut tristique justo gravida  eget.  </teaser>

      <body>Suspendisse euismod porta  suscipit. Donec non lorem ut sem varius  sollicitudin at non risus.  Curabitur nec tellus vitae nunc porttitor

ultrices quis vulputate  magna. Nulla feugiat, nisl ut tristique dapibus,  lacus sem pharetra  leo, eu vestibulum arcu diam vel ante. Suspendisse  dolor mauris,

pellentesque non mollis eu, accumsan quis nisi. Phasellus  nec nulla eget eros aliquet fringilla ac eget enim.</body>

      <published>01/01/2011 13:45:33</published>

        <category>News

          <subcategory>Bussiness</subcategory>

        </category>

        <author>Alice B</author>

    </document>

    <document>

      <id>news_2</id> <!-- article id, primary key -->

      <title>Mauris at odio eget neque pellentesque</title>

      <teaser>Etiam lobortis, diam non tincidunt scelerisque, tellus augue.</teaser>

       <body>Lorem ipsum dolor sit amet, consectetur adipiscing  elit.  Morbi  consectetur tristique lectus, dictum vestibulum magna  mattis in.

Vestibulum tellus metus, interdum eget aliquam ac,  feugiat  dapibus  felis. Nulla molestie fringilla elit, ac faucibus leo  fringilla  in.</body>

      <published>28/01/2010 8:12:55</published>

        <category>Sports

          <subcategory>Tennis</subcategory>

        </category>

        <author>Jonathan C</author>

    </document>

  </cpse:content>

</cpse:request>

Note that subtags of category such as ‘subcategory’ also can be included to enable multi-level hierarchial faceted search results return that can be used for navigation.

Facets generation per search query

Within ‘search’ request content tag supply a tag ‘facet’, which identifies XPath location relative to the tag for which index facet policy rule is defined. Note that this request will generate all facets found matching the actual search query, along with number of hits found per each facet.

Example request (multi-level level faceted search drilldown sample):

 

<?xml version="1.0" encoding="utf-8"?>

<cpse:request xmlns:cps="www.clusterpoint.com">

  <cpse:storage>news</cpse:storage>

  <cpse:command>search</cpse:command>

  <cpse:user>root</cpse:user>

  <cpse:password>password</cpse:password>

  <cpse:content>

    <query>od*</query>

    <docs>10</docs>

    <offset>0</offset>

    <facet>category</facet> <!-- will retrieve all top categories -->

    <facet>category=News/subcategory</facet> <!-- will retrieve all subcategories for "News" category -->

    <facet>author</facet> <!--  will retrieve all authors -->

    <list>

      <id>yes</id>

      <teaser>highlight</teaser>

      <body>snippet</body>

    </list>

  </cpse:content>

</cpse:request>

 

 

Example response (single level drilldown)::

 

<cpse:content>

 

  <facet path="category">

    <term hits="25">News</term>

    <term hits="1">Comments</term>

    <term hits="7">Questions</term>

    <term hits="7">Replies</term>

  </facet>

</cpse:content>

 

 

For multi level drilldown, simply pass correct deeper XPath location. Be sure to add “=<selected value>” to each parent category or you will receive invalid hits.

Example request (multi level drilldown):

 

<cpse:content>

 

  <facet path="category=News/subcategory">

    <term hits="10">Sports</term>

    <term hits="1">Business</term>

    <term hits="3">Politics</term>

  </facet>

 

</cpse:content>

 

You can combine as many facets as you need per each search query, however, please note that too many different facets can negatively affect performance for very large databases.

 

4.5.2.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=search&storage=test&query=George

4.5.2.3.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <problems><!-- This element only appears if some search terms were ignored. -->

    limpat/term (limited patter coverage) or patign/term (frequent pattern - ignored) or worign/term (frequent word - ignored)

    multiple ignored terms appear as problem/term problem/term problem/term

  </problems>

  <ignored>common words that are ignored when performing the search</ ignored >

      <realquery>real query that was used to perform the search, including the derived words from the wildcard usage and dropped ignored words</ realquery>

      <found> number of documents found </found>

      <hits> approximate total amount of results that match the search query </hits>

      <more> number that indicates how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and an at least number if in the form of >N</more>

      <from> documents in the result set  within a numerical range: the FROM value </from>

      <to> documents in the result set  within a numerical range: the TO value </to>

      <results>

                  <document> meta data of the document found </document> <!-- This element is repeated for all documents found. -->

      </results>

</cpse:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.3.  API command - SELECT

The select command searches for document by their identifiers. It is possible to select one document by a precisely entered document identifier or to use wildcard pattern to select all documents that identifiers match the wildcard pattern entered. For example, if only the asterisk * is entered, identifiers for all document in the Clusterpoint storage will be returned.

The default number of document identifiers returned to result set is 1024, but this number can be changed by entering a different number in the <docs> tag.

4.5.3.1.      XML Request

<cpse:content>

      <document>

                  <id>document id *</id>

      </document>

      <docs> number of document identifiers in the result set </docs>

      <offset> intend from the beginning of the result set</offset>

</cpse:content>

4.5.3.2.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <found> number of document identifiers matched </found>

      <from> document identifiers in the result set  within a numerical range: the FROM value </from>

      <to> document identifiers in the result set  within a numerical range: the TO value </to>

      <results>

                  <id> meta data of the document found </id> <!-- This element is repeated for all document identifiers matched. -->

      </results>

</cpse:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

4.5.4.  API command - SIMILAR

The similar command searches for similar documents in the Clusterpoint storage to a textual information, which is given directly, or which is contained by a document. The textual information, to which similar documents are searched for, is also referred as the input text.

The algorithm that is searching for similar documents uses statistical information about the number of times words contained by the input text, or so called keywords, appear in documents and finds similar documents to the input text fragment or document with a given ID.

You must take into account that the algorithm uses statistical information about words and does not know their meaning. Therefore, similar documents might not be semantically alike, however, when working with large text collections that contain medium large documents, shows that the algorithm works fine.

4.5.4.1.      XML Request

<cpse:content>

      <id> document id to which similar documents must be searched for ** </id>

      <text> textual information to which similar documents must be searched for ** </text>

      <len> number of keywords in the input text * </len>

      <quota> minimal amount of keywords that must be found in documents, which are returned the search result *</quota>

      <docs> number of documents to be retuned in the result set </docs>

      <offset> intend from the beginning of the result set </offset>

</cpse:content>

For large text collections in the Clusterpoint storage (> 1 million documents), best practice shows that the len element equal to 20 and the quota element equal to 4 gives the best results. However, you can experiment to find the best values for your specific text collection.

The two asterisks ** means that only one from the two elements must be entered, in other means, the relationship between these two elements is XOR.

When developing Web applications, please take into account performance implications when parameters len and the quota are not restricted in values.  Usually it is not reccommended to allow end-users to change those value set by application, otherwise some queries can take much longer time than normally - even tens of seconds per query.  For high volume high usage Web sites this search functionality should be used with care and those parameters always fine tuned for specific data storage by the data storage owner.

4.5.4.2.      HTTP GET Parameters

Due to some internal limitations in API, one command cannot be used with XOR type parameters. So two commands are used. Command ‘similar-id’ is used, when search is performed against document ID and ‘similar-text’ – when searching for similar text. Command ‘similar’ is provided for backward compatibility only and acts as ‘similar-id’.

http://host/cgi-bin/cpse/cpse-gw.cgi?command=similar-id&storage=test&id=Doc1&len=20 &quota=4

 

http://host/cgi-bin/cpse/cpse-gw.cgi?command=similar-text&storage=test&text=George&len=20 &quota=4

4.5.4.3.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <found> number of documents found </found>

      <hits> approximate total amount of results that match the search query </hits>

      <more> number that indicates, how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and the minimum number if in the form of >N</more>

      <from> documents in the result set  within a numerical range: the FROM value </from>

      <to> documents in the result set  within a numerical range: the TO value </to>

      <results>

                  <document> meta data of the document found </document> <!-- This element is repeated for all documents found. -->

      </results>

</cpse:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

 

4.5.5.  API command - ALTERNATIVES

If the alternatives search is performed, the system returns a set of alternative words from the Clusterpoint storage vocabulary, which are similar in spelling or has a different language declination, for example, if you enter ”bote”, then ”bite” and “byte” are offered for searching. Note that only actual words from the Clusterpoint storage are returned that has been indexed by Clusterpoint Server.  Words not present in any of the indexed documents, are not available in Clusterpoint Server Vocabulary and therefore can not be returned by this command.

This feature can be used for fuzzy searches and for spelling error corrections.

Alternative words are returned from the vocabulary, which ensures that the alternative words are actual words that are in imported to the Clusterpoint storage.  When searching alternative words, the alternatives command considers the statistical information about the occurrence of the alternative word in the vocabulary, and the similarity of the alternative word to the search term. In other words, alternatives that occur in the Clusterpoint storage more often and that are more similar to the search term are returned.

4.5.5.1.      XML Request

<cpse:content>

      <query> search query * </query>

      <cr> Minimum ratio to include the alternative in the search query between the occurrence of the alternative and the occurrence of the search term. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.</cr>

      <idif> Maximum number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference. If you increase this parameter, there are greater number of results returned to the result set, however performance is reduced.</idif>

      <h> Minimum number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.<h>

</cpse:content>

If values for the cr, idif, or h tags are not defined, corresponding parameters set in the Clusterpoint storage configuration file are used.

For more information configuring Clusterpoint storage, see the Clusterpoint Server User Guide.

4.5.5.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=alternatives&storage=test&query=George

4.5.5.3.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <alternatives_list>

                  <alternatives>

                              <to> alternative search term </to>

                              <count> number of times the alternative search term occurs in the Clusterpoint storage</count>

                              <word count=”number of times the alternative occurs in the Clusterpoint storagecr=”ratio between the occurrence of the alternative and the occurrence of the search termidif=”number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference” h=”number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value”> alternative </word><!-- This element is repeated for each alternative word.-->

                  </alternatives><!-- This element is repeated for each search term.-->

      </alternatives_list>

</cpse:content>

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

 

4.5.6.  API command - LIST-LAST

The list-last command searched for documents in the Clusterpoint storage that most recently have been inserted, updated, or replaced, using the insert, update, or replace commands, respectively.

4.5.6.1.      XML Request

<cpse:content>

      <docs> number of documents in the result set </docs>

      <offset> intend from the beginning of the result set</offset>

</cpse:content>

4.5.6.2.      HTTP GET Parameters

http://host/cgi-bin/cpse/cpse-gw.cgi?command=list-last&storage=test&docs=10&offset=100

4.5.6.3.      XML Reply

If the command is executed successfully, the XML reply contains the following command specific data.

<cpse:content>

      <found>number of documents returned to the result set</found>

      <results>

                  <document> meta data for the list-last command</document><!-- This element is repeated for all documents found. -->

      </results>

</cpse:content>

Meta data for the list-last command is information included in tags, for which the document policy property value  list is set to YES. By default, these are id, title, and rate tags.

For more information on how to define document storage policy, see Importing XML Data With Custom Structure.

If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.

 

4.6.                    CONTEXT TRIGGERS FOR ALERTING APPLICATIONS

 Clusterpoint Server allows user to implement and use additional filtering functionality in user application for a specific storage using API command set for establishing, removing and activating context triggers.

Please note that Clusterpoint Server do not activate any filtering context trigger event on its own.  You should always programm this functionality into your own application using API commands set provided in this section.

Remember that you are responsible to check for context matching events and decide when to activate scripts for examining context trigger matches for stored documents.

Context triggers are defined as search queries that can be performed against storage inside server internally, and some event script (usually sheel script performed).  Special API command must be used to activate actual document check against a list of defined filters - Clusterpoint API command EXAMINE.

Context trigger mechanism controlled by application gives user application even more flexibility in alert handling, for example, to avoid starting automatic triggerring of events after data reindexes which can cause massive alert event generation such as sending out email messages etc.

Alerting API commands are sent to server using standard  Clusterpoint XML messaging.

4.6.1.  API command - ADD_TRIGGER

This command add context trigger with supplied ID to the filtering list of all context triggers activated for that document storage. 

For one document storage there can be only one filtering list of context triggers.

There can be hundreds and thosuands of context triggers in the filtering list.

Each added context trigger will match documents against query supplied in filter tag.  You should keep trigger IDs known to your application too, as this is the only reference of individual context triggers in the filter list of storage.  Those IDs are being returned when executing triger events with API command EXAMINE for specific document (see below).

< CPSSPC:command>add_trigger</ CPSSPC:command>

< CPSSPC:content>

      <id>Trigger id</id>

      <filter>Trigger filter query</filter>

      <recipient>Recipient of notification</recipient>

</ CPSSPC:content>

 

Please note, that storage document ID is not the same 'id' used for trigger definition.

 

4.6.2.  API command - REMOVE_TRIGGER

This command removes specific context trigger from the filtering list of all supplied triggers.

< CPSSPC:command>remove_trigger</ CPSSPC:command>

< CPSSPC:content>

      <id>Trigger id</id>

</ CPSSPC:content>

4.6.3.  API command - CLEAR_TRIGGERS

This commnad clears all context triggers for the specific storage and filtering list becomes empty.

< CPSSPC:command>clear_triggers</ CPSSPC:command>

4.6.4.  API command - EXAMINE

This command examines an existing document with specific ID against all context triggers found in the filtering list. If notify parameter is set to yes a shell script is executed for each context trigger that matches document.

Also in reply to this command list of context triggers-id’s that matched this document is returned.

This list of context triggers IDs can be processed by user application to develop advanced content monitoring and alerting business applications.

 

< CPSSPC:command>examine</ CPSSPC:command>

< CPSSPC:content>

      <document>

                  <id>document id to examine</id>

      </document>

      <notify>yes/no – to send message or not</notify>

</ CPSSPC:content>

4.6.5.  Filter script configuration parameters

Storage configuration has to be set up to specify shell script that will be executed, when trigger is matched against document, if this functionality is requested by EXAMINE command.

This can be done using Clusterpoint Manager tool Storage Configuration option, like:

<config>

      <alerts>

                  <action>Shell script to execute</action>

      </alerts>

</config>

Shell script can be any alert activity such as sending email message, or writing log file, or updating customer notification database.

 

 

 

 

4.7.                    ERROR HANDLING

If a command sent to the Clusterpoint Server is not executed successfully, an error is returned in the following XML reply message:

 <?xml version="1.0" encoding="REPLY-ENCODING"?>

<cpse:reply xmlns:cps="www.clusterpoint.com">

  <cpse:storage> Storage name </cpse:storage>

  <cpse:timestamp> reply creation date and time </cpse:timestamp>

  <cpse:command> command name for which the reply is created </cpse:command>

  <cpse:requestid> message request id for which the reply is created </cpse:requestid>

  <cpse:content> command specific data </cpse:content>

  <cpse:seconds> time spent for the reply creation </cpse:seconds>

  <cpse:error>

    <code> error code (4 digits) </code>

    <text> error textual message </text>

    <level> error severity </level>

    <source> subsystem in which the error occurred </source>

    <document_id> document_id that the error refers to - for some errors </document_id>

  </cpse:error>

</cpse:reply>

The error severity can be one of the following:

Error severity can be one of the following:

 

Type

Description

DEBUG

Debug information, can be switched on/off by Storage configuration directive '/config/debug'.

NOTIFICATION

Information that may be useful. No action is necessary.

INFORMATION

Information that is useful. This type of error should be noted, though no action is necessary.

WARNING

Returned when the command has been executed successfully, but some problem indications exist.

REJECTED

Returned when the the input data are incorrect, command has been not executed.

FAILED

Returned when temporary system problem occured, command has been not executed.

ERROR

Returned when system internal error occured, command has been not executed.

FATAL

Returned when Clusterpoint Server is not functioning.

 

The purpose of the error severity is to inform that:

·         If the error severity is fatal or error, inform the system administrator and stop working with the Clusterpoint Server.

·         If the error severity is failed, rejected or warning, the errors are logged and can analyzed, while work can be continued.

When checking for errors in an automated script, always use the error code or error level as the reference point, not the error message text, as the text can vary, while the error code always stays the same.

Full list of possible error codes is available ERROR MESSAGES.

When reporting errors to technical support, please provide full information - the entire request and reply, Clusterpoint Server version, OS as well as any other useful information that can help to resolve problem.

Clusterpoint Server is a transaction-based system, which means that commands has a predefined timeout period. If a command is not executed during this predefined timeout period, the command returns the error.

It is possible to define a timeout period for the request, or configure it for the Clusterpoint Server.

For more information on configuring timeout periods for the Clusterpoint Server, see the Clusterpoint Storage Configuration File.

For more detailed error references, description of most commonly encountered errors, error code groupings by problem areas, and the complete list of all error codes please see APPENDIX A.

 

 

5.   Clusterpoint Server Clustering

This section describes Clusterpoint Server clustering technology and provides general steps for working with it. Using the Clusterpoint Server clustering technology, several Clusterpoint Servers can be joined into one cluster, which enables that search can be performed in a text collection of an unlimited size.

This section contains the following topics:

·         Principles

·         Creating Clusterpoint Server Cluster

5.1.                    Principles

A Clusterpoint Server cluster consists of nodes. Each node is a computer, on which the Clusterpoint Server is installed.

The Clusterpoint Server cluster technology has a transparent architecture, which implies that each node is fully functional on its own.  Access to all cluster nodes or to a single cluster node is very simple and described below.

5.2.                    Creating Clusterpoint Server Cluster

 

Clustering is generic feature of Clusterpoint software.  You can create clusters of multiple hardware servers for storing and searching in very large data sets - hundreds of gigabytes and terabytes.  You can benefit both from workload and data sharing among multiple hardware, and from resilience against failures if you keep multiple mirrored copies.

Clustering API command set was made transparent to application developer by default.

To create a Clusterpoint Server cluster storage, proceed as follows:

1)      Let us assume that we have a very large text collection that is too big to be stored and worked with on single computer (we treat it as as node no. 1). In that case, estimate in how many equal in size parts the text collection can be divided so that each part can be stored and worked with on single computer.

2)      On the number of computers estimated in the previous step, install the second Clusterpoint Server in your network (as node no. 2).

3)      Selecting option Cluster Storages in the Clusterpoint Manager management tool, create a clustered Clusterpoint storage on selected hadrware servers (nodes node1 and node2 in our sample use case), that is visible as Clusterpoint Server hardware in your network.

4)      Create an application, which imports each part of the text collection to its own node, addressing it through Clusterpoint gateway URL:

http://node1.domain.com/cgi-bin/cpse/cpse-gw.cgi?command=insert&storage=test&id=1&title=Doc1

 

http://node2.domain.com/cgi-bin/cpse/cpse-gw.cgi?command=insert&storage=test&id=1&title=Doc2

 

 

 

When you have created a Clusterpoint Server cluster storage, you can either:

·         Connect to any of the Clusterpoint Server cluster’s nodes with your Clusterpoint API request as it would be single storage command.  The node transparently to your application connects to all other hardware nodes in the Clusterpoint Server cluster and thus a unified result of a search query in the whole text collection is created.  The same is true for indexing API comamnds.  If mirrored cluster storage was created, then all updates will be identically indexed on all cluster hardware nodes.  If striped cluster storage configuration was created, then the hardware node with the least workload will be selected for indexing of the individual document (automatic load balancing is used to determine where to place individual document in the clustered storage).

·         Connect to only one Clusterpoint Server cluster node and, thus, perform a search only within a part of the whole text collection, by specifying <cpse:type>single</cpse:type> in Clusterpoint XML request message envelope and connecting directly to the gateway module IP address of that specific Clusterpoint Server.  Then API command will be performed only for that specific node data set (or 1/Nth of total database content).

·         In this way you can selectively index or search part of the data depending on your application logic and specific business application data distribution needs.

 

This optional mechanism to directly address cluster nodes according to your own application logic, without a single point of failure, excellently works in moder private cloud computing environments. 

For example, you can set up a pool of Clusterpoint database mirror servers (all running the same mirrored database) to be randomly queried by your application, implementing nearly linear load sharing.   Or you can address specific cluster nodes to store and search data, without initiating cluster-wide operations. 

 

It is also high-performance networking oriented solution, avoiding unnecessary data transfers through some strictly specified gateway software, creating a single point of failure for your application and performance trade-off.

 

 

6.   Use Cases

This section contains sample applications or use cases written for the Clusterpoint Server system. Each use case is in a separate section and contains a short description and source code.

This section contains the following use cases:

·         Use Case in C: Importing Text Files

·         Use Case in Perl: Importing Text Files

·         Use Case in PHP: Searching Clusterpoint storage and Returning Results in HTML

·         Use Case in ASP: Searching Clusterpoint storage and Returning Results in HTML

·         Use Case in Java: Searching Clusterpoint storage from applet

6.1.                    Use Case in C: Importing Text Files

This application is implemented in the C programming language.

The application reads files from the file system and imports them to the Clusterpoint storage. The application is receiving file names as command line arguments.

It also detects whether the file is a text file or a binary file by counting whitespaces in them: if a file contains relatively less whitespaces, it is considered to be a binary file, and if a file contains relatively more whitespaces, it is considered to be a text file.  

/**

 * This application is implemented in the C programming language.

 *

 * The application reads files from the file system and imports them to the

 * Clusterpoint storage via HTTP POST interface using libcurl.

 *

 * The application receives file names as command line arguments.

 *

 * It also detects whether the file is a text file or a binary file by counting

 * whitespaces in it: if a file contains relatively less whitespaces, it is

 * considered to be a binary file, and if a file contains relatively more

 * whitespaces, it is considered to be a text file.

 */

 

/* include standard headers */

 

#include <stdio.h>

#include <stdlib.h>

#include <sys/types.h>

#include <sys/stat.h>

#include <unistd.h>

#include <ctype.h>

 

/* libcurl */

 

#include <curl/curl.h>

 

/* connection parameters */

 

char *url = "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi";

char *storage = "test";

char *user = "name";

char *passwd = "pass";

char *encoding = "US-ASCII";

char *post_fmt = "storage=%s&command=insert&user=%s&password=%s&id=%s&title=%s&rate=%d&text=%s&encoding=%s";

 

#define REQUIRED_WHITESPACE_FRACTION 0.12

 

typedef struct { int len, used; char *buf; } curl_reply;

 

 

 

size_t read_reply(void *buffer, size_t size, size_t nmemb, void *userp)

{

      int new_len;

 

      curl_reply *r = (curl_reply *) userp;

      for (new_len = r->len; new_len < r->used + size * nmemb + 1; new_len *= 2);

      if (new_len > r->len) r->buf = realloc(r->buf, new_len);

      memcpy(r->buf + r->used, buffer, size * nmemb);

      r->len = new_len;

      r->used += size * nmemb;

      r->buf[r->used] = '\0';

     

      return size * nmemb;

}

 

 

 

int main(int argc, char *argv[])

{

      CURL *curl_handle;

      char *storage_esc, *user_esc, *passwd_esc, *title_esc, *text_esc, *encoding_esc;

      curl_reply reply;

      char *err_buf[CURL_ERROR_SIZE];

      int i;

     

      if (argc == 1) {

            printf("Usage: [-r url] [-s storage] [-u user] [-p password] [-e encoding] files\n");

            return 0;

      }

 

      /* read options */

     

      for (i = 1; i < argc; i++) {

            if (argv[i][0] == '-') {

                        if (i + 1 >= argc) break; /* no option value */

                        switch(argv[i][1]) {

                        case 'r':

                                    url = argv[i+1];

                                    break;

                        case 's':

                                    storage = argv[i+1];

                                    break;

                        case 'u':

                                    user = argv[i+1];

                                    break;

                        case 'p':

                                    passwd = argv[i+1];

                                    break;

                        case 'e':

                                    encoding = argv[i+1];

                                    break;

                        default:

                                    printf("Unknown option: %s\n", argv[i]);

                                    break;

                        }

                        i++;

            }

      }

     

      /* initialization */

 

      curl_global_init(CURL_GLOBAL_ALL);

      curl_handle = curl_easy_init();

      curl_easy_setopt(curl_handle, CURLOPT_URL, url);

      curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, read_reply);

      curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, &reply);

      curl_easy_setopt(curl_handle, CURLOPT_ERRORBUFFER, err_buf);

 

      storage_esc = curl_escape(storage, 0);

      user_esc = curl_escape(user, 0);

      passwd_esc = curl_escape(passwd, 0);

      encoding_esc = curl_escape(encoding, 0);

 

      /* names of files to be imported are passed as arguments

       * process each of them */

 

      for (i = 1; i < argc; i++) {

            FILE *f;

            struct stat st;

            int k, spaces;

            char *buf, *post_data;

 

            /* check if argument is option */

 

            if (argv[i][0] == '-') {

                        if (i + 1 >= argc) break; /* no option value */

                        i++;

                        continue;

            }

 

            printf("Reading file: '%s'\n", argv[i]);

           

            /* open file */

 

            f = fopen(argv[i], "r");

            if (!f) {

                        fprintf(stderr, "Couldn't open file '%s'\n", argv[i]);

                        continue;

            }

 

            /* retrieve file information */

 

            if (fstat(fileno(f), &st) < 0) {

                        fprintf(stderr, "Filesystem error retrieving info on '%s'\n", argv[i]);

                        fclose(f);

                        continue;

            }

            if (!S_ISREG(st.st_mode)) {

                        fprintf(stderr, "File '%s' is not regular file\n", argv[i]);

                        fclose(f);

                        continue;

            }

            printf("\tSize: %d bytes\n", st.st_size);

 

            /* read all of it into memory */

           

            /* note: this sample program asumes all of file fits into memory

             * so if you need to work with larger files figure out something else */

 

            buf = (char *) malloc(st.st_size + 1);

            if (!buf) {

                        fprintf(stderr, "Memory allocation failed\n");

            }

            k = fread(buf, 1, st.st_size, f);

            fclose(f);

            if (k < st.st_size) {

                        fprintf(stderr, "Error reading file\n");

                        free(buf);

                        continue;

            }

            buf[k] = '\0';

 

            /* see if it is text file

             * estimate that by counting white space:

             * natural language text in contrary to binary data

             * must contain significant portion of whitespace */

 

            spaces = 0;

            for (k = 0; k < st.st_size; k++) {

                        if (isspace(buf[k])) spaces++;

            }

            if (spaces < st.st_size * REQUIRED_WHITESPACE_FRACTION) {

                        printf("\tBinary file: ignored\n");

                        free(buf);

                        continue;

            }

 

            /* execute Clusterpoint Server insert command through HTTP POST interface */

 

            title_esc = curl_escape(argv[i], 0);

            text_esc = curl_escape(buf, k);

 

            post_data = malloc(strlen(storage_esc) + strlen(user_esc) + strlen(passwd_esc) + strlen(post_fmt) + 2 * strlen(title_esc) + 20 + strlen(text_esc) + strlen(encoding_esc));

            sprintf(post_data, post_fmt, storage_esc, user_esc, passwd_esc, title_esc, title_esc, 100, text_esc, encoding_esc);

            curl_easy_setopt(curl_handle, CURLOPT_POSTFIELDS, post_data);

 

            reply.buf = malloc(reply.len = 1);

            reply.used = 0;

            if (curl_easy_perform(curl_handle) != CURLE_OK) {

                        fprintf(stderr, "Error connecting to Clusterpoint Server: %s\n", err_buf);

            } else if (strstr(reply.buf, "<cpse:error>")) { /* simplified error check */

                        *((char *) strstr(reply.buf, "</text>")) = '\0';

                        fprintf(stderr, "Error returned from Clusterpoint Server: %s\n", strstr(reply.buf, "<text>") + 6);

            } else {

                        *((char *) strstr(reply.buf, "</docid>")) = '\0';

                        printf("Document inserted with id %s\n", strstr(reply.buf, "<docid>") + 7);

            }

 

            /* cleanup */

           

            free(buf);

            free(reply.buf);

            free(title_esc);

            free(text_esc);

            free(post_data);

      }

 

      /* final cleanup */

     

      free(storage_esc);

      free(user_esc);

      free(passwd_esc);

      free(encoding_esc);

     

      curl_easy_cleanup(curl_handle);

      curl_global_cleanup();

     

      return 0;

}

6.2.                    Use Case in Perl: Importing Text Files

This application is implemented in the Perl programming language.

The application reads files from the file system and imports them to the Clusterpoint storage. The application is receiving file names as command line arguments.

#

# This application is implemented in the Perl programming language.

#

# The application reads files from the file system and imports them to the

# Clusterpoint storage through HTTP POST interface using libcurl.

#

# The application receives file names as command line arguments.

#

# It also detects whether the file is a text file or a binary file by counting

# whitespaces in it: if a file contains relatively less whitespaces, it is

# considered to be a binary file, and if a file contains relatively more

# whitespaces, it is considered to be a text file.

#

 

use HTTP::Request::Common;

use LWP::UserAgent;

use File::stat;

 

# connection parameters

$url = "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi";

$storage = "test";

$user = "name";

$passwd = "pass";

$encoding = "US-ASCII";

 

$REQUIRED_WHITESPACE_FRACTION = 0.12;

 

 

if (@ARGV == 0) {

      print "Usage: [-r url] [-s storage] [-u user] [-p password] [-e encoding] files\n";

      exit;

}

 

# read options

for ($i = 0; $i < @ARGV; $i++) {

      if (substr($ARGV[$i], 0, 1) eq '-') {

            if ($i + 1 >= @ARGV) { last; } # no option value

            $opt = substr($ARGV[$i], 1, 1);

            $val = $ARGV[$i + 1];

            if ($opt eq 'r') {

                        $url = $val;

            } elsif ($opt eq 's') {

                        $storage = $val;

            } elsif ($opt eq 'u') {

                        $user = $val;

            } elsif ($opt eq 'p') {

                        $passwd = $val;

            } elsif ($opt eq 'e') {

                        $encoding = $val;

            } else {

                        print "Unknown option: ", $ARGV[$i], "\n";

            }

            $i++;

      }

}

 

$ua = LWP::UserAgent->new;

 

# names of files to be imported are passed as arguments

# process each of them

for ($i = 0; $i < @ARGV; $i++) {

 

      # check if argument is option

      if (substr($ARGV[$i], 0, 1) eq '-') {

            if ($i + 1 >= @ARGV) { last; } # no option value

            $i++;

            next;

      }

 

      print "Reading file: '", $fn = $ARGV[$i], "'\n";

     

      # open file

      if (open(f, $fn)) {

     

            # retrieve file information

            if ($st = stat(*f)) {

                        if (($st->mode & S_IFMT) == S_IFREG) {

                                    print "\tSize: ", $st->size, " bytes\n";

                                   

                                    # read all of it into memory

                                    # note: this sample program asumes all of file fits into memory

                                    # so if you need to work with larger files figure out something else

                                    if (sysread(*f, $buf, $st->size) == $st->size) {

                                   

                                                # see if it is text file

                                                # estimate that by counting whitespace in it:

                                                # natural language text in contrary to binary data

                                                # must contain significant portion of whitespace

                                                $nspaces = $buf =~ s/(\s)/$1/g;

                                                if ($nspaces >= $st->size * $REQUIRED_WHITESPACE_FRACTION) {

                                                           

                                                            # execute Clusterpoint Server insert command through HTTP POST interface */

                                                            $response = $ua->request(POST $url, [

                                                                        storage => $storage,

                                                                        command => 'insert',

                                                                        user => $user,

                                                                        password => $passwd,

                                                                        id => $fn,

                                                                        title => $fn,

                                                                        rate => 100,

                                                                        text => $buf,

                                                                        encoding => $encoding

                                                            ]);

                                                           

                                                            if ($response->is_success && $response->content) {

                                                                        if ($response->content !~ /<cpse:error>/) { # simplified error check

                                                                                    $response->content =~ /<docid>([^<]*)<\/docid>/;

                                                                                    print "Document inserted: docid = $1\n";

                                                                        } else {

                                                                                    $response->content =~ /<code>([^<]*)<\/code>/;

                                                                                    print STDERR "Error returned from Clusterpoint Server: $1 - ";

                                                                                    $response->content =~ /<text>([^<]*)<\/text>/;

                                                                                    print STDERR "$1\n";

                                                                        }

                                                            } else {

                                                                        print STDERR "Error connecting to Clusterpoint Server: ", $response->code, ' - ', $response->message, "\n";

                                                            }

                                                } else {

                                                            print "\tBinary file: ignored\n";

                                                }

                                    } else {

                                                print STDERR "Error reading file\n";

                                    }

                        } else {

                                    print STDERR "File '$fn' is not a regular file\n";

                        }

            } else {

                        print STDERR "Filesystem error retrieving info on '$fn'\n";

            }

            close(f);

      } else {

            print STDERR "Could not open file '$fn'\n";

      }

}

6.3.                    Use Case in PHP: Searching Clusterpoint storage and Returning Results in HTML

This application is implemented in the PHP programming language.

The application searches the Clusterpoint storage and returns the results in HTML.

<?

 

//

// This application is implemented in the PHP programming language.

//

// The application searches the Clusterpoint storage using HTTP API and returns the

// results in HTML.

//

 

$Clusterpoint Server_SERVER = "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi";

$Clusterpoint Server_STORAGE = "test";

$Clusterpoint Server_USER = "name";

$Clusterpoint Server_PASSWD = "pass";

 

//search query

$query = $_GET["q"];

 

//current position in results

$curr_position = $_GET["p"];

 

//data encoding

$encoding = "UTF-8";

 

//send http header with correct encoding

send_headers($encoding);

 

//max results on page

$result_on_page = 10;

 

if (empty($curr_position) || $curr_position < 0) {

      $curr_position = 0;

}

 

//max page from one site to show

$max_page_from_site = 2;

 

$xml_text = file_get_contents($Clusterpoint Server_SERVER . "?storage=$Clusterpoint Server_STORAGE&command=search&user=" . urlencode($Clusterpoint Server_USER) . "&password=" . urlencode($Clusterpoint Server_PASSWD) . "&query=" . urlencode($query) . "&docs=$result_on_page&offset=$curr_position&from_site=$max_page_from_site&encoding=UTF-8");

if ($xml_text == "") {

      die("Clusterpoint Server_search error!");

}

 

//initialize xml to array object

$xml2a = new XMLToArray();

 

//parse xml

$root_node = $xml2a->parse($xml_text);

 

//pop root node from array

$cpse_reply = array_shift($root_node["_ELEMENTS"]);

 

//array for storing data from search results

//like total time spent, hits, and so on

$spec_data = array();

 

// examining Clusterpoint Server reply elements

foreach ($cpse_reply["_ELEMENTS"] as $cpse_reply_el) {

      if ($cpse_reply_el["_NAME"] == "seconds") {

      $spec_data[$cpse_reply_el["_NAME"]] = $cpse_reply_el["_DATA"];

}

 

// examining Clusterpoint Server content elements folder

foreach ($cpse_reply_el["_ELEMENTS"] as $cpse_content) {

      $spec_data[$cpse_content["_NAME"]] = $cpse_content["_DATA"];

      $last_site = '';

      foreach($cpse_content["_ELEMENTS"] as $results) {

            $tit = "";

            $others = "";

            //parse each document tag from the result set

            foreach($results["_ELEMENTS"] as $documents) {

                        switch ($documents["_NAME"]) {

                        case "title" :

                                    $tit .= '<b>'.$documents["_DATA"].'</b>';

                                    break;

                        case "id" :

                                    $others .= '<br/><font size="-1"> ID: '.$documents["_DATA"].'</font>';

                                    break;

                        case "site" :

                                    if ($last_site == $documents["_DATA"])

                                                $blockquote = TRUE;

                                    else

                                                $blockquote = FALSE;

                                    $others .= '<br/><font size="-1"> Site: <i>'.$documents["_DATA"].'</i></font>';

                                    $last_site = $documents["_DATA"];

                                    break;

                        case "rate" :

                                    break;

                        case "info" :

                                    break;

                        case "text" :

                                    if (!empty($documents["_DATA"]))

                                                $others .= '<br/><font size="-1"> Snippet: <i>'.$documents["_DATA"].'</i></font>';

                                                break;

                                    }

                        }

           

                        //tab sites

                        if ($blockquote)

                                    $output .= '<blockquote>'.$tit.$others.'</blockquote>';

                        else

                                    $output .= '<br>'.$tit.$others.'<br>';

            }

      }

}

 

if ($spec_data["hits"] == 0) {

      $output = "Your search <b>$query</b> did not match any documents!";

} else {

      //page listing

      $from = $curr_position + 1;

      $to = $curr_position + strval($spec_data["found"]);

      $list_begin_pos=0;

      $list_end_pos=$curr_position+($result_on_page*10);

      $page_list .= "<font size=\"-1\">";

      $p = 1;

 

      if($curr_position > ($result_on_page * 10)){

            $list_begin_pos=$curr_position-($result_on_page*10);

            $p=intval($list_begin_pos/$result_on_page)+1;

      }

 

      if ($curr_position > 0) {

            $page_list .= " <a href=\"search.php?p=". ($curr_position - $result_on_page) ."&q=".urlencode(stripslashes($query))."\">&lt;&lt;Previous</a> &nbsp;&nbsp;";

      }

 

      $more_tag = $spec_data["more"];

      if ($more_tag[0] == '=') {

            $more = substr($more_tag,1);

      } else {

            $more = substr($more_tag,1) + 1;

      }

 

      for ($i = $list_begin_pos; $i - ($curr_position + $more) < $result_on_page && $i < $list_end_pos; $i+= $result_on_page) {

            if($i>=1000) break;

            if ($i != $curr_position) {

                        $page_list .= "<a href=\"search.php?p=$i&q=".urlencode(stripslashes($query)).(!empty($dir)?"&dir=$dir":"")."\">$p</a> ";

            } else {

                        $page_list .= "<b>$p </b>";

            }

            $p++;

      }

     

      if (($result_on_page+$curr_position)-($curr_position+$more) < 10 && $curr_position + $result_on_page < 1000) {

            $page_list .= "&nbsp;&nbsp; <a href=\"search.php?p=".($curr_position + $result_on_page)."&q=".urlencode(stripslashes($query))."\">Next&gt;&gt;</a>";

      }

     

      $page_list .= "</font>";

      //end of page listing

}

 

//echo output to client

echo '

<html>

<head>

<style><!--

body,td,p,a{font-family:arial,sans-serif;}

.servkat{color:003399; text-decoration:none}

.homepage{color:003399; text-decoration:none; font-size:10px;}

//-->

</style>

</head>

<body>

<table>

<tr bgcolor=\"#cccc66\">

<td><font size=\"-1\">&nbsp; Searched for: <b>'.$query.'</b> Results: <b>'.$from.'</b> - <b>'.$to.'</b> from <b>'.$spec_data["hits"].'</b> Search lasted <b>'.$spec_data["seconds"].'</b> seconds </font>&nbsp;</td>

<tr>

</table>

'.$output.'<br>

'.$page_list.'

</body>

</html>';

 

//#########################################################

 

class XMLToArray

{

      //----------------------------------------------------------------------

 

      // private variables

 

      var $parser;

      var $node_stack = array();

 

      //----------------------------------------------------------------------

 

      // PUBLIC

      // If a string is passed in, parse it right away.

 

      function XMLToArray($xmlstring="")

      {

            if ($xmlstring) return($this->parse($xmlstring));

            return(true);

      }

 

      //----------------------------------------------------------------------

 

      // PUBLIC

      // Parse a text string containing valid XML into a multidimensional array

      // located at root node.

 

      function parse($xmlstring="")

      {

            // set up a new XML parser to do all the work for us

            $this->parser = xml_parser_create();

            xml_set_object($this->parser, $this);

            xml_parser_set_option($this->parser, XML_OPTION_CASE_FOLDING, false);

            xml_set_element_handler($this->parser, "startElement", "endElement");

            xml_set_character_data_handler($this->parser, "characterData");

 

            // Build a Root node and initialize the node_stack

            $this->node_stack = array();

            $this->startElement(null, "root", array());

 

            // parse the data and free the parser...

            xml_parse($this->parser, $xmlstring);

            xml_parser_free($this->parser);

           

            // recover the root node from the node stack

            $rnode = array_pop($this->node_stack);

           

            // return the root node

            return($rnode);

      }

 

      //----------------------------------------------------------------------

 

      // PROTECTED

      // Start a new Element. This is done by pushing the new element onto the stack

      // and reseting its properties.

 

      function startElement($parser, $name, $attrs)

      {

            // create a new node

            $node = array();

            $node["_NAME"] = $name;

 

            foreach ($attrs as $key => $value) {

                        $node[$key] = $value;

            }

 

            $node["_DATA"] = "";

            $node["_ELEMENTS"] = array();

            // add the new node to the end of the node stack

            array_push($this->node_stack, $node);

      }

 

      //----------------------------------------------------------------------

 

      // PROTECTED

      // End an element. This is done by popping the last element from the

      // stack and adding it to the previous element on the stack.

 

      function endElement($parser, $name)

      {

            // pop this element off the node stack

            $node = array_pop($this->node_stack);

            $node["_DATA"] = trim($node["_DATA"]);

            // and add it an element of the last node in the stack...

            $lastnode = count($this->node_stack);

            array_push($this->node_stack[$lastnode-1]["_ELEMENTS"], $node);

      }

     

      //----------------------------------------------------------------------

     

      // PROTECTED

      //Collect the data onto the end of the current chars.

 

      function characterData($parser, $data)

      {

            // add this data to the last node in the stack...

            $lastnode = count($this->node_stack);

            $this->node_stack[$lastnode-1]["_DATA"] .= $data;

      }   

     

      //----------------------------------------------------------------------

}

 

//#########################################################

//## END OF CLASS

//#########################################################

 

//sends Content-type header to client browser

 

function send_headers($encoding)

{

      Header("Content-type: text/html;charset=$encoding");

}

 

?>

6.4.                    Use Case in ASP: Searching Clusterpoint storage and Returning Results in HTML

This application is implemented in the ASP programming language.

The application searches the Clusterpoint storage and returns the results in HTML.

<%

 

'

' This application is implemented in the VBScript programming language.

'

' The application searches the Clusterpoint storage using HTTP API and returns the

' results in HTML.

'

 

%>

<html>

<head>

<title>Search</title>

<style>

#results div.header { margin-bottom: 35px; }

#results div.result { padding-left: 15px; }

#results p.title { margin-bottom: 3px; }

#results p.snip { margin: 0px; }

#results p.id { margin-top: 3px; font-size: 0.9em; color: gray; }

#results p.error { color: red; }

#results .pagelist { padding-top: 20px; }

#results .pagelist p { display: inline; }

#results .pagelist ul { margin: 0px; padding: 0px; display: inline; }

#results .pagelist li {   display: inline; }

</style>

</head>

<body>

      <div id="results">

<%

 

nDocs = 10 'results per page

nPages = 10 'pages listed

 

Offset = Int(Request.QueryString("page")) * nDocs 'pages are numbered from 0, displayed from 1

sQuery = Request.QueryString("query")

 

Set Http = Server.CreateObject("MSXML2.ServerXMLHTTP")

Http.Open "POST", "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi", False

Http.Send "storage=test&command=search&docs=" & nDocs & "&offset=" & Offset & "&relevance=yes&query=" & Server.URLEncode(sQuery)

 

if Http.Status = 200 and not Http.ResponseXML is Nothing then

      Set Dom = Http.ResponseXML

      Dom.SetProperty "SelectionNamespaces", "xmlns:cpse='www.clusterpoint.com'"

      Set Content = Dom.SelectSingleNode("cpse:reply/cpse:content")

      if not Content is Nothing then

            'command executed ok

            n = Int(Content.SelectSingleNode("hits").Text)

            %><div class="header"><p>Found <b><% if n > 0 then Response.Write n else Response.Write "no"

            %></b> document<% if n <> 1 then Response.Write "s"

            if not Content.SelectSingleNode("real_query") is Nothing then sRealQuery = Content.SelectSingleNode("real_query").Text else sRealQuery = ""

            %> matching &quot;<%= Server.HTMLEncode(sRealQuery)

            %>&quot; (<b><%= Dom.SelectSingleNode("cpse:reply/cpse:seconds").Text %></b> seconds)</p></div><%= vbCrLf %><%

            if n > 0 then

                        'something has been found

                        for each Result in Content.SelectNodes("results/document")

                                    %><div class="result"><%

                                    sTitle = Result.SelectSingleNode("title").Text

                                    %><p class="title"><a href="<%= Server.HTMLEncode(Result.SelectSingleNode("id").Text) %>"><%= Server.HTMLEncode(sTitle) %></a></p><%

                                    %><p class="snip"><%= Replace(Result.SelectSingleNode("text").Text, "#", " ") %></p><%

                                    %><p class="id"><%= Server.HTMLEncode(Result.SelectSingleNode("id").Text) %></p><%

                                    %></div><%= vbCrLf %><%

                        next

                        'page listing

                        iFrom = Int(Content.SelectSingleNode("from").Text)

                        nMore = Int(Mid(Content.SelectSingleNode("more").Text, 2))

                        nSure = Int((nMore + iFrom + 2 * nDocs - 1) / nDocs)

                        if iFrom > 0 or nMore > 0 then

                                    %><div class="pagelist"><%= vbCrLf %><p>Result pages:<%= vbCrLf %><%

                                    iPage = Int(iFrom / nDocs)

                                    i = Int((iPage - 1) / (nPages - 2)) * (nPages - 2)

                                    if i < 0 then i = 0

                                    %><ul><%= vbCrLf %><%

                                    sLink = "<a href=""search.asp?query=" & Server.URLEncode(sQuery) & "&page="

                                    if iPage > 0 then

                                                %><li><%= sLink & (iPage - 1) %>">&lt;&lt;&lt; Previous</a></li><%= vbCrLf %><%

                                    end if

                                    j = i

                                    do while j < i + nPages and j < nSure

                                                %><li><%

                                                if j = iPage then

                                                            %><b><%= j + 1 %></b><%

                                                else

                                                            %><%= sLink & j %>"><%= j + 1 %></a><%

                                                end if

                                                %></li><%= vbCrLf %><%

                                                j = j + 1

                                    loop

                                    if nMore > 0 then

                                                %><li><%= sLink & (iPage + 1) %>">Next &gt;&gt;&gt;</a></li><%= vbCrLf %><%

                                    end if

                                    %></ul><%= vbCrLf %></p><%= vbCrLf %></div><%= vbCrLf %><%

                        end if

            end if

      else

            'error

            Set Content = Dom.SelectSingleNode("cpse:reply/cpse:error")

            %><p class="error">Error <%= Content.SelectSingleNode("code").Text %>: <%= Server.HTMLEncode(Content.SelectSingleNode("text").Text) %></p><%= vbCrLf %><%

      end if

else

      %><p class="error">Search failed!</p><%= vbCrLf %><%

end if

 

%> </div>

</body>

</html>

6.5.                    Use Case in Java: Searching Clusterpoint storage from applet

This application is implemented as Java applet.

The application searches the Clusterpoint storage and displays results.

6.5.1.  Clusterpoint ServerJApi.java

import java.awt.*;

import javax.swing.*;

import java.awt.event.*;

import java.util.Random;

 

public class Clusterpoint ServerJApi extends JApplet implements ActionListener {   

    private JPanel contentPane;

    private JPanel Buttons = new JPanel();                

    private JPanel Results = new JPanel();

    private JPanel Properties = new JPanel();

    private JPanel MainPan = new JPanel();

    private JLabel HostLabel = new JLabel("Host:");

    private JLabel StorageLabel = new JLabel("Storage:");

    private JLabel QueryLabel = new JLabel("Query:");

    private JButton SearchButt = new JButton("Search");

    private JButton ClearButt = new JButton("Clear");

    private JTextField QueryField = new JTextField(10);

    private JTextField HostField = new JTextField("http://",20);

    private JTextField StorageField = new JTextField(10);    

    private JTextArea ResultArea = new JTextArea();

   

    public void init() { 

        //ContentPane Layout

        contentPane = (JPanel) this.getContentPane();

        contentPane.setLayout(new BorderLayout());                                       

       

        //Main pane

        MainPan.setLayout(new GridBagLayout());

        //Properties pane

        Properties.setLayout(new GridBagLayout());               

        Properties.setBorder(BorderFactory.createTitledBorder("Properties"));

        //Buttons pane

        Buttons.setLayout(new GridLayout(1,2,5,0));       

        //Results pane

        GridLayout gridLayout1 = new GridLayout();

        gridLayout1.setVgap(0);

        gridLayout1.setHgap(0);

        gridLayout1.setColumns(1);

        gridLayout1.setRows(10);

        Results.setLayout(new BorderLayout());                   

        Results.setBorder(BorderFactory.createTitledBorder("Results"));

        //Add Main pane to contentPane

        contentPane.add(MainPan,BorderLayout.NORTH);

       

        //Properties pan to Main pane

        MainPan.add(Properties,new GridBagConstraints(0, 0, 1, 1, 1.0, 1.0

            ,GridBagConstraints.NORTH, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));       

        //Add controls to Properties Pane

        Properties.add(HostLabel,new GridBagConstraints(0, 1, 1, 1, 1.0, 1.0

            ,GridBagConstraints.WEST, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 1, 0));

        Properties.add(HostField,new GridBagConstraints(1, 1, 1, 1, 1.0, 1.0

            ,GridBagConstraints.WEST, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));

        Properties.add(StorageLabel, new GridBagConstraints(2, 1, 1, 1, 1.0, 1.0

            ,GridBagConstraints.WEST, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 1, 0));

        Properties.add(StorageField, new GridBagConstraints(3, 1, 1, 1, 1.0, 1.0

            ,GridBagConstraints.WEST, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));

        Properties.add(QueryLabel, new GridBagConstraints(0, 2, 1, 1, 1.0, 1.0

            ,GridBagConstraints.WEST, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 0, 0));

        Properties.add(QueryField, new GridBagConstraints(1, 2, 3, 1, 1.0, 1.0

            ,GridBagConstraints.WEST, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));

        //Add Buttons to Main Pan

        MainPan.add(Buttons,new GridBagConstraints(0, 1, 1, 1, 1.0, 1.0

            ,GridBagConstraints.NORTH, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 20, 0));

        Buttons.add(SearchButt,null);

        Buttons.add(ClearButt,null);

       

        MainPan.add(Results,new GridBagConstraints(0, 2, 1, 1, 1.0, 1.0

            ,GridBagConstraints.NORTH, GridBagConstraints.BOTH, new Insets(1, 1, 0, 1), 0,0));

       

        Results.add(ResultArea);                                     

        SearchButt.addActionListener(this);        

        ClearButt.addActionListener(this);                       

    }       

 

    //Action listener search and clear buttons

    public void actionPerformed(ActionEvent e) {

        if (e.getSource() == ClearButt) {

            QueryField.setText("");

        } else if (e.getSource() == SearchButt)  {           

            String cgiUrl = new String(HostField.getText());

            Clusterpoint ServerExch cpseReq;

           

            //Create Clusterpoint XML query

            Clusterpoint ServerMess cpseXMLQuery = new Clusterpoint ServerMess("search",StorageField.getText(),"name","pass",QueryField.getText());

            

            //Do data exchange with Clusterpoint Server

            cpseReq = new Clusterpoint ServerExch(cgiUrl,cpseXMLQuery.getMess());                        

           

            cpseReq.doQuery();

           

            String temp = cpseReq.getResponse();                     

           

            //Parse out XML results

            Clusterpoint ServerXMLParser cpseXMLResp = new Clusterpoint ServerXMLParser(temp.trim());

           

            String[][] resArray = new String[10][];

           

            resArray = cpseXMLResp.parse();                       

            String outp = "";

           

            //Format output

            for (int i = 0; i < cpseXMLResp.getResultLength(); i++) {

                System.out.println("URL["+i+"]: "+resArray[i][1]+" Title: "+resArray[i][0]);

                JLabel u;

               

                outp += "Title: "+resArray[i][0]+"\n";

                outp += "ID: "+resArray[i][1]+"\n\n";

            }    

           

            ResultArea.setText(new String(outp));                     

                       

        }

    }

 

}

6.5.2.  Clusterpoint ServerMess.java

import java.util.Calendar;

 

public class Clusterpoint ServerMess {

    private String iComm;

    private String iData;

    private String iUser;

    private String iPass;

    private String iStorage;

    private String message;   

 

    /** Creates new Clusterpoint ServerMess */

    public Clusterpoint ServerMess(String command,String storage,String user, String passwd) {

        iComm = command;      

        iData = null;

        iUser = user;

        iPass = passwd;

        iStorage = storage;

       

        message = ComposeMess();

       

    }

   

    public Clusterpoint ServerMess(String command,String storage, String user, String passwd,String data) {

        iComm = command;

        iData = data;       

        iUser = user;

        iPass = passwd;

        iStorage = storage;

        message = ComposeMess();

    }

   

    public String getMess() {

        return message;

    }

   

    private String ComposeMess() {

        String mess = ""; 

        long current = System.currentTimeMillis(); 

        mess =  "<cpse:request xmlns:cpse=\"www.clusterpoint.com\">";

        mess += "<cpse:timestamp>"+Calendar.YEAR+"/"+Calendar.MONTH+"/"+Calendar.DAY_OF_MONTH+" "+Calendar.HOUR+":"+Calendar.MINUTE+":"+Calendar.SECOND+"</cpse:timestamp>";

        mess += "<cpse:command>"+iComm+"</cpse:command>";

        mess += "<cpse:requestid>"+current+"</cpse:requestid>";

        mess += "<cpse:storage>"+iStorage+"</cpse:storage>";

        mess += "<cpse:reply_charset>UTF-8</cpse:reply_charset>";

        mess += "<cpse:user>"+iUser+"</cpse:user>";

        mess += "<cpse:password>"+iPass+"</cpse:password>";

        mess += "<cpse:application>JavaApi</cpse:application>";

        mess += "<cpse:content>";

        if (iComm == "search") {

            mess += "<query>"+iData+"</query>";

            mess += "<docs>10</docs>";

        }

        mess += "</cpse:content>";

        mess += "</cpse:request>";

        return mess;

    }   

 

}

6.5.3.  Clusterpoint ServerExch.java

import java.io.*;

import java.net.*;

import javax.swing.*;

 

public class Clusterpoint ServerExch {

    private String iHost;

    private String iData;

    private String query;

    private String response;

    private String iFname;

 

    /** Creates new cpse_network */

    public Clusterpoint ServerExch(String Host, String data)  {                   

        iData = data;

        URL aURL=null;

        try {

            aURL = new URL(Host);

        } catch (MalformedURLException e) {

            System.out.println("Malformed URL");           

        }

        iHost = aURL.getHost();

        iFname = aURL.getFile();

       

    }

   

    public int doQuery() {

 

        try {

            Socket socket = new Socket(iHost,80);

            BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));

            socket.setSoTimeout(60000); // set 1 minute timeout

            String header =  "POST "+iFname+"  HTTP/1.0\r\n";

                   header += "Host: "+iHost+"\r\n";

                   header += "User-Agent: Clusterpoint Server Client Sample\r\n";

                   header += "Content-Length: " + iData.getBytes("UTF-8").length+"\r\n\r\n";

 

    wr.write(header);

            wr.write(iData);

 

            wr.flush();

 

    response = read_socket(socket);

 

    wr.close();

            socket.close();

 

        } catch (UnknownHostException e) {

            System.err.println("Exception: Unknown host " + iHost + "!");

            System.exit(1);

        } catch (IOException e) {

            System.err.println("Exception: I/O error during connection!");

            System.exit(1);

        }

 

        return 1;

 

    }

 

   

    public String getResponse() {

        return response;

    }

   

    private String read_socket(Socket fsocket){

      String reply="";

      try{

            BufferedReader rd= new BufferedReader(new InputStreamReader(fsocket.getInputStream()));                       

           

            StringBuffer tempresp = new StringBuffer();

           

            int ch=0;

                                              

            while (true){               

                ch = rd.read();

                if (ch < 0)

                    break;

                else

                    tempresp.append((char)ch);                                               

            }                                                  

           

            reply = new String(tempresp);

           

            reply = reply.substring(reply.indexOf("\r\n\r\n"));                                                                                                 

                       

      } catch (InterruptedIOException e){

            return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";

      } catch (UnknownHostException e){

            return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";

      } catch (IOException e){

            return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";

      } catch (NullPointerException e){

            return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";

      }

    return reply;

}

 

}

6.5.4.  Clusterpoint ServerXMLParser.java

import java.io.*;

import javax.xml.parsers.*;

import org.w3c.dom.*;

import org.xml.sax.*;

 

public class Clusterpoint ServerXMLParser {

    private String iData;

    private String[][] Resultset = new String[10][2];

    private int ResCount = 0;

 

    /** Creates new Clusterpoint ServerXMLParser */

    public Clusterpoint ServerXMLParser(String data) {

        iData = data;       

    }

   

    public String[][] parse() {

        try {

            DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();

            Document doc = builder.parse(new InputSource(new StringReader(iData)));

            

            NodeList nodes = doc.getElementsByTagName("document");

           

            ResCount = nodes.getLength();

           

            for (int i = 0; i < nodes.getLength(); i++) {

                Element element = (Element) nodes.item(i);

               

                NodeList title = element.getElementsByTagName("title");

                Element line = (Element) title.item(0);              

               

                Resultset[i][0] = new String(getCharacterDataFromElement(line));

 

                NodeList url = element.getElementsByTagName("id");

                line = (Element) url.item(0);

               

                Resultset[i][1] = new String(getCharacterDataFromElement(line));

            }

           

        } catch (Exception e) {           

            e.printStackTrace();

            return null;

        }

        return Resultset;

    }

    public int getResultLength() {

        return ResCount;

    }

    public String getCharacterDataFromElement(Element e) {

        Node child = e.getFirstChild();

        if (child instanceof CharacterData) {

            CharacterData cd = (CharacterData) child;

            return cd.getData();

        }

        return "?";

    }

}

 

 

 

Appendix A: Error Messages

Error Handling

Generally error handling in Clusterpoint Server system has been implemented as simple adding of XML 'error' tag for every error encountered. 

Specific error message is returned in Clusterpoint XML reply containing error code, error text, error severity description and subsystem where error was generated.  

If a command sent to the Clusterpoint Server is not executed successfully, an error is returned in the following XML reply message:

<?xml version="1.0" encoding="REPLY-ENCODING"?>

<cpse:reply xmlns:cps="www.clusterpoint.com">

  <cpse:storage> Storage name </cpse:storage>

  <cpse:timestamp> reply creation date and time </cpse:timestamp>

  <cpse:command> command name for which the reply is created </cpse:command>

  <cpse:requestid> message request id for which the reply is created </cpse:requestid>

  <cpse:content> command specific data </cpse:content>

  <cpse:seconds> time spent for the reply creation </cpse:seconds>

  <cpse:error>

    <code> error code (4 digits) </code>

    <text> error textual message </text>

    <level> error severity </level>

    <source> subsystem in which the error occurred </source>

    <document_id> document_id that the error refers to - for some errors </document_id>

  </cpse:error>

</cpse:reply>

The error severity can be one of the following:

Title

Description

Warning

Returned when the command is executed successfully, but there are some problem indications

Failed

Returned when incorrect input data.

Error

Returned when error in the command execution.

Fatal

Returned when the system is not functioning.

The purpose of the error severity is to inform the system:

·         If the error severity is fatal or error, the system work must be interrupted and the system administrator must be informed.

·         If the error severity is failed or warning, the errors can be logged and analyzed, while the system work can be continued.

Clusterpoint Server is a transaction-based system, which means that commands has a predefined timeout period. If a command is not executed during this predefined timeout period, the command returns the error.

It is possible to define a timeout period for the request, or configure it for the Clusterpoint Server.

For more information on configuring timeout periods for the Clusterpoint Server, see the Clusterpoint Server User Guide.

 

 

The following section contains a list of most common error messages and their codes.  Error messages can differ from the listed texts and the list could be incomplete as new software versions are released from time to time.  Please contact Clusterpoint for updated list of Clusterpoint Server Error Messages.

Following table contains most frequently occurred error codes. If you get error code, that is not found in this table and meaning cannot be understood from message, please contact technical support.

Code

Level

Error message

Possible causes

Suggested solutions

1212

Fatal

Out of memory

Server is out of memory.

Stop unused storages and other processes. This will most likely cause data corruption.

1419

Error

Connection to node has failed.

Cluster node (or storage) is down.

Start stopped storages, check network connectivity.

1517

Fatal

I/O error. Unable to write to disk.

There are some problems with disk or disk is full.

If disk is full, delete unnecessary data. If disk is not full - check for hardware errors.

1518

Fatal

I/O error. Unable to read from disk

There are problems with disk.

Check for hardware errors.

1520

Error

Cannot write to temporary directory.

Clusterpoint Server process cannot write to temporary directory.

Check storage configuration, where you have defined temporary directory. Check permissions.

1612

Error

Invalid XML.

Clusterpoint Server got invalid XML in inter-process communications.

Contact technical support and report bug.

1619

Rejected

Invalid XML

Request was invalid XML.

Fix request and try again.

1837

Rejected

Invalid user name and/or password or permission denied.

Invalid user name and/or password was passed with request or request operation is not allowed.

Correct user name and/or password or contact your administrator.

1839

Fatal

Invalid license - invalid signature.

License file is corrupted or modified.

Obtain valid license.

1842

Error

License has expired.

Time-limited license is expired.

Obtain new license to continue using Clusterpoint Server.

1844

Fatal

Invalid license - wrong server ID.

Installed license file does not match server.

Install correct license for server.

1845

Fatal

License not found.

License is not installed.

Obtain and install license.

2231

Error

Failed to exchange data with gateway.

Clusterpoint Server process is not available.

Start Clusterpoint Server.

2344

Rejected

Storage not available.

Request was sent to storage, that is not active or does not exist.

Start stopped storage or fix request.

2417

Rejected

Unknown command.

Request contains invalid command name.

Fix request and try again.

2420

Warning

Old namespace, please upgrade client library.

You are using library, that uses old (1.x) request namespace.

Upgrade (or fix) library.

2421

Rejected

Single command not found.

You are trying to execute cluster command in single node.

Check and fix your request.

2422

Rejected

Cluster command not found.

You are trying to execute single command in cluster node.

Check and fix your request.

2425

Rejected

Missing required parameter.

Your request misses some mandatory parameters.

Check documentation and fix your request.

2428

Failed

Node not available.

Clusterpoint Server tried to execute command on node, that is not available.

Contact technical support.

2430

Error

Received zero length message.

Inter-process communication received empty message.

This could mean server overload and timeout in request processing. If it is not server overload, please contact technical support.

2437

Rejected

Document root tag is not present.

Your data modification request contains wrong (according to policy) document root tag.

Check policy and change request.

2445

Rejected

Auto-increment feature is not available.

You are sending documents with autoincrement IDs, but autoincrement feature is not available.

Restore autoincrement feature using command reset-autoincrement-status.

2448

Failed

Stripe not available, modifications denied.

Al least one stripe node is not available. To preserve cluster integrity, modification requests are denied.

Restore failed node or remove it from configuration.

2626

Rejected

Duplicate primary key.

Your insert requests contains documents with IDs, that already exists in storage.

Change document IDs, use update/replace request or delete existing document first.

2824

Rejected

Requested document does not exist.

You are trying to retrieve document, that does not exist in storage.

Change request document ID to existing one.

2826

Rejected

Synchronization source not found.

You are trying to perform synchronization from invalid source.

Check and fix request 'from' parameter.

2830

Failed

Long modification process is running.

Storage is performing long modification process, for example reindex or restore.

Try your request later.

 

For different Clusterpoint Server versions the list of errors may slightly differ.

 

There are several groups of errors returned to the developer application and logged out into log files. Each error group has a range of 4-digit codes. There are the following groups of errors in the overall Clusterpoint Server system - see Table 1:

TABLE 1.  Groups of errors and their code range in Clusterpoint Server software

Code range

Errors group

Description of errors group

1xxx

general errors

general errors which can be returned by any  subsystem

2xxx

database engine errors

core database engine errors

3xxx

apps errors

Clusterpoint application subsystem errors

4xxx

tools errors

Clusterpoint tools errors (pre-packaged applications)

 

From the application developer point of view the most interesting errors are 'General errors' and 'Search engine errors'.  These errors are being generated in the Clusterpoint Server core engine subsystems or Web server gateway module, as responses to user application XML query messages.  Application developer can process those errors according to the custom business logic needs in his own application.

There are also more detailed groupings of error types according to the area of problems encountered.

If there is an error code encountered which is not present in the complete error list of this guide, please use the Table 2 to check to which subgroup (and related problem area) that unlisted error could relate.  When looking for cause of the problem for unlisted errors always look into Table2 for tips what might be wrong.

 

TABLE 2.  Error code ranges for specific problem areas

Code range

Types of error

Description

11xx

general problems

general problems

12xx

memory problems

memory problems

13xx

process problems

process problems

14xx

network problems

network problems

15xx

storage (disk) problems

storage (disk) problems

16xx

xml parser problems

xml parser problems

17xx

lock problems

lock problems

18xx

users authorization problems

users authorization problems

19xx

common

common errors

21xx

general problems

general problems

22xx

cpse

Core database server errors

23xx

cps2-storage

Storage demon errors

24xx

cps2-filter

Filtered triggers subsystem errors

25xx

cpse-gw

Clusterpoint Server Web server gateway module errors

26xx

cps2-master

Clusterpoint Server master demon errors

 

The complete list of all system errors and their codes are shown in reference table below (Table 3).

 

TABLE 3.  List of all software system errors in Clusterpoint Server

1111

ERROR_DEBUG_NOTIFICATION

1112

ERROR_BUS_ERROR

1113

ERROR_SYSTEM_BUSY

1114

ERROR_SYSTEM_DOWN

1115

ERROR_INVALID_RESPONSE

1116

ERROR_INVALID_TAG

1211

ERROR_MEMORY_CORRUPTION

1211

ERROR_SEGMENTATION_FAULT

1212

ERROR_OUT_OF_MEMORY

1213

ERROR_MEMORY_MAPPING

1214

ERROR_MEMORY_ALLOCATION

1311

ERROR_CREATE_THREAD

1411

ERROR_DAEMON_START

1412

ERROR_DAEMON_RECEIVE

1413

ERROR_DAEMON_SEND

1414

ERROR_DAEMON_EXCHANGE

1415

ERROR_DAEMON_SOCKET

1511

ERROR_DISK_IO

1512

ERROR_PERMISSION

1611

ERROR_XML_PARSING

1612

ERROR_BAD_XML

1613

ERROR_XML_MANIPULATION

1614

ERROR_XML_DUMP

1615

ERROR_XML_CORRUPTED

1711

ERROR_MODIFY_LOCK_TIMEOUT

1712

ERROR_ACCESS_LOCK_FULL

1713

ERROR_ACCESS_LOCK_TIMEOUT

1714

ERROR_MODIFY_RELEASE

1715

ERROR_ACCESS_RELEASE

1811

ERROR_AUTH_MODULES

1812

ERROR_AUTH_HANDLER

1813

ERROR_AUTH_SKIP

1814

ERROR_AUTH_ENUMERATESTOP

1815

ERROR_AUTH_NOTSUPPORTED

1816

ERROR_AUTH_PARAMETERS

1817

ERROR_AUTH_FAILED

1818

ERROR_AUTH_USEREXISTS

1819

ERROR_AUTH_NOUSER

1820

ERROR_AUTH_GROUPEXISTS

1821

ERROR_AUTH_NOGROUP

1822

ERROR_AUTH_INTERNAL

1823

ERROR_AUTH_CONFLICT

1824

ERROR_AUTH_CIRCULAR

1825

ERROR_AUTH_UNAVAILABLE

1826

ERROR_AUTH_DENIED

1827

ERROR_AUTH_DUPLICATE

1911

ERROR_QUEUE_DUMP

1912

ERROR_QUEUE_PUSH

1913

ERROR_TABLE_NOTIFICATION

1914

ERROR_TABLE_STRUCTURE

1915

ERROR_TABLE_READ

1916

ERROR_TABLE_CLOSE

1917

ERROR_TABLE_INTEGRITY

1918

ERROR_TABLE_CONFIRM

1919

ERROR_TABLE_ADVANCE

1920

ERROR_TABLE_PUT

1921

ERROR_TABLE_WRITE

1922

ERROR_TABLE_SYNCHRONIZE

1923

ERROR_TABLE_OPEN

1924

ERROR_TABLE_TRANSFER

2111

ERROR_STORAGE_UNAVAILABLE

2112

ERROR_STORAGE_NOT_FOUND

2113

ERROR_MISSING_REQUESTID

2114

ERROR_WRONG_STORAGE

2115

ERROR_AUTHORIZATION_USER

2116

ERROR_AUTHORIZATION_IP

2117

ERROR_AUTHORIZATION_PASS

2118

ERROR_AUTHORIZATION_UNKNOWN

2119

ERROR_INVALID_COMMAND

2211

ERROR_CPSE_CONFIGURATION

2311

ERROR_POSSIBLE_CORRUPTION

2312

ERROR_DATA_CORRUPTION

2313

ERROR_DUPLICATE_KEY

2314

ERROR_KEY_NOT_FOUND

2315

ERROR_DOCUMENT_UNAVAILABLE

2411

ERROR_FILTER_COMMAND

2412

ERROR_FILTER_CONFIGURATION

2413

ERROR_FILTER_STRUCTURE

2414

ERROR_FILTER_MESSAGE

2415

ERROR_FILTER_ADD

2416

ERROR_FILTER_GET

2511

ERROR_GW_PRECONDITION

2512

ERROR_GW_PARAMETER

2513

ERROR_GW_TEMPLATE

2514

ERROR_GW_REQUEST

2515

ERROR_GW_XSLT

2516

ERROR_GW_CONFIGURATION

2517

ERROR_GW_CONVERT

2518

ERROR_GW_STORAGE

2519

ERROR_GW_COMMAND

2520

ERROR_GW_DISPATCH

2611

ERROR_IDX_OVERLOAD

2612

ERROR_IDX_EXCEPTION

2613

ERROR_IDX_3RD_CACHE

2614

ERROR_IDX_CANCEL

2615

ERROR_IDX_STATUS

2616

ERROR_IDX_INTEGRITY

2617

ERROR_IDX_READ

2618

ERROR_IDX_POOL_STATE

2619

ERROR_IDX_POOL_SAVE

2620

ERROR_IDX_HASH

2711

ERROR_MGR_CONFIGURATION

2712

ERROR_MGR_NOTIFICATION

2713

ERROR_MGR_REQUEST_TYPE

2714

ERROR_MGR_NO_LOG

2715

ERROR_EMPTY_CLUSTER

2716

ERROR_TEMPLATE_PARSING

2717

ERROR_NODE_UNAVAILABLE

2718

ERROR_NODE_NETWORK

2719

ERROR_NODE_ERROR

2720

ERROR_NODE_RESPONSE

2721

ERROR_NO_RESULT

2722

ERROR_MGR_SIM_WORDS

2723

ERROR_MGR_SIM_RESPONSE

2724

ERROR_MGR_SIM_DOCUMENT

2725

ERROR_MGR_SIM_PARAMETER

2726

ERROR_MGR_REINDEX

2811

ERROR_PARSER_TIMEOUT

2812

ERROR_ALTERNATIVES

2813

ERROR_VOC_CLOSE

2814

ERROR_VOC_INIT

2815

ERROR_WILDCARD

 

Reporting Problems

Problems that occur when working with Clusterpoint Server can be reported to Clusterpoint technical support.

To report a problem, provide the following to the technical support:

·         name of your company

·         Clusterpoint DBMS license number

·         version of the Clusterpoint DBMS software

·         relevant Clusterpoint Server log file items

For contact information on technical support, see Contact Information.

 

Appendix B: Frequently Asked Questions

This section contains the following frequently asked questions:

·         How can I import binary data to Clusterpoint Server?

·         How can I make Clusterpoint Server to automatically ignore common words when performing FTS?

·         How can I see the actual query that used for FTS?

·         How can I export the vocabulary with an Clusterpoint API command?

·         Why do I get an error: connection failed when importing data to the Clusterpoint Server from a Windows NT environment?

·         Is it possible to return more than 1000 documents to the result set?

·         When I import large amount of data to the Clusterpoint storage, why are they not available for FTS for a while?

 

          How can I import binary data to Clusterpoint Server?

To import binary data like MS Word or PDF document files to the Clusterpoint storage, they must be entered in the info document part.

Note:       Data in the info part are not available for FTS. If you want your data to be available for FTS, they must be stored as plain text.

Usually, binary data do not comply with the XML formatting standard. However, to be imported to the Clusterpoint storage, they must comply with the XML formatting standard. Therefore, before importing to the Clusterpoint storage, you must encode the binary data to the base64 encoding or other.

For more information on document parts, see Understanding Clusterpoint Server Document Structure.

How can I make Clusterpoint Server to automatically ignore common words when performing FTS?

The Clusterpoint Server can be configured to ignore at search words that appear in the Clusterpoint storage most often using the customer supplied ignored words list. These words are considered to be common words that are ignored during FTS due to performance reasons.

It is possible to edit the ignored words list.

For more information on ignored words, see Ignored Words.

How can I see the actual query that used for FTS?

Often the actual query that is used for FTS differ from that you entered as a search query. Reasons for this can be the following:

·         If the original query contains words from the ignored words list, they are dropped from the actual search query.

·         If the original query contains wildcard patterns, a class of words created from the wildcard pattern usage is entered in the actual search query.

To see the actual query used for FTS, use the <real_query> tag of the XML reply to the search command.

For more information on the search command, see Search.

How can I export the vocabulary with an Clusterpoint API command?

The vocabulary is a list of all unique words in the Clusterpoint storage. Unique words are found in documents and added to the vocabulary while storing these documents to the Clusterpoint storage. Each Clusterpoint storage has its own vocabulary.

Unfortunately, it is not possible to export the vocabulary with any of the Clusterpoint API commands.  However, on the file level the vocabulary is stored in a text file, where each line contains one word. You can copy this text file and view it.

For information on the vocabulary text file, see the Clusterpoint Server User Guide.

For more information on vocabulary, see Understanding Storing Information in Clusterpoint Server.

Why do I get an error: connection failed when importing data to the Clusterpoint Server from a Windows NT 4.0 or Windows 2000 environment?

Importing data to the Clusterpoint Server, just like any other operation with the Clusterpoint Server, is performed by transporting XML requests and replies via HTTP.

When importing large amount of data to the Clusterpoint Server, many TCP/IP connections are opened. After the connections are closed, they remain in the TIME_WAIT state for a definite time period.

By default, in the Windows NT 4.0 or Windows 2000 environment, the limit of the connections is inconsiderably small and the TIME_WAIT state time period is too long.

Therefore, because the number of new connections created per second can be very large and the closed connections remain in the TIME_WAIT state for some time period, the number of connections can reach the limit very fast.

In that case, the system does not allow to create a new connection and the error is returned.

To configure the limit of the connections and the TIME_WAIT state time period, configure the following key in the Windows NT 4.0 or Windows 2000 registry:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]

"TcpTimedWaitDelay"=dword:00000015

where, the value is a decimal number representing seconds.

 

 

Is it possible to return more than 1000 documents to the result set?

By default, the limit of documents to be returned to the result set is 1000. It is possible to increase this limit. However, there is the following functionality, which is designed for the maximum number of documents in the result set equal to 1000:

·         sorting search results by the relevance

·         grouping search results by a group

If you increase the limit of documents in the result set, the limit will be applied for all functions, except, if sorting search results by relevance or group, only 1000 documents will be returned to the result set.

If you increase the limit of documents in the result set, it means that transactions in the Clusterpoint Server will be performed in a longer time period. Therefore, you should also increase the timeout period of functions.

For more information on the relevance, see Relevance.

For more information on grouping documents by a group, see Search.

When I import large amount of data to the Clusterpoint storage, why are they not available for FTS for a while?

When importing data to the Clusterpoint storage, if the memory reserved for memory cache is not enough for the data amount being imported, then:

1.       The data being imported are written to another cache, which is written to the disk, and the index state is expanding.

4.       When the importing is complete, the Clusterpoint Server is committing data written on the disk to the Clusterpoint Index, and the index state is collapsing.

While the index state is expanding or collapsing, the data written to the disk are not available for FTS. Only when data are added to the Clusterpoint Index, they are available for FTS.

For example, if the massive data amount to be imported is tens of GB, these data will not be available for FTS for few hours.

For more information on the index state, see Status.

 

Appendix C: STORAGE CONFIGURATION FILE

 

This appendix section contains parameter description, that can enable to set up some default configuration options per each Clusterpoint storage.

In addition to configuring the Storage Document policy, which directly controls how document contents are handled during indexing and search, some additional configuration options can also be set for each storage.

These options are configured by using the Clusterpoint Manager or in XML configuration file ‘config.xml’, located into same name directory as the storage, by any text editor.

The default options are shown in the table below. Additional options may be available if required.

Configuration option

Full path

Possible values

Description

General configuration

Temporary directory

/config/tmpdir

A valid filesystem path (default is "/tmp")

Sets the temporary directory for Clusterpoint Server. Backups are expanded here during restore.

Backup directory

/config/backup_dir

A valid filesystem path
(default is "/home/cps/backups")

Sets the directory where backup files will be placed.

Authorization check

/config/authorization

yes/no (default yes)

Enables/disables authorization check when executing any of requests.

Debug information

/config/debug

yes/no (default no)

Turns on/off printing debug information in log files.

Worker pool size

/config/workers

Positive integer (default 5)

Set default count of worker threads which servers request queue.

Autostart flag

/config/bootable

yes/no (default no)

Enables/disables storage start on server start.

Storage description

/config/description

String

Storage description, showed in Clusterpoint Manager.

Repository configuration

Highlight open mark

/config/repository/highlight/open_mark

Any characters (default is "<b>")

Sets the character/character sequence that begins a highlighted word.

Highlight close mark

/config/repository/highlight/close_mark

Any characters. (default is "</b>")

Sets the character/character sequence that ends a highlighted word.

Bwd_chars

/config/repository/snippet/bwd_chars

Positive integer (default is 150).

When creating a snippet, the snippet will begin with a word up to bwd_chars before the position in the text which corresponds best to the query.

P_words

/config/repository/snippet/p_words

Positive integer number (default is 1).

The snippet will have no more than p_words + q_words*(number of words in the query) words corresponding to the query.

Q_words

/config/repository/snippet/q_words

Positive integer (default is 2).

Fwd_chars

/config/repository/snippet/fwd_chars

Positive integer (default is 250).

The snippet will end if the next word in the document that corresponds to the query is more than fwd_chars characters or fwd_punct punctuation marks distant from the previous word that corresponds to the query.

Fwd_punct

/config/repository/snippet/fwd_punct

Positive integer (default is 200).

Max_size

/config/repository/snippet/max_size

Positive integer (default is 2500).

The snippet will not be longer than max_size characters. If the snippet, according to the rest of the parameters, would be much shorter than bwd_chars, another snippet is created and appended to the previous one; such snippets are separated by "...".

Bucket size

/config/repository/bucket_size

Positive integer (default is 134217728 bytes).

Default size for bucket in bucket pool where data are stored.

Bucket saturation

/config/repository/bucket_compact_saturation

Positive double in range 0.0-1.0 (default is 0.5).

Threshold when compacting mechanism starts to defragment buckets.

Index configuration

Memory pool size

/config/index/memory_pool_size

Positive integer (default is 100MB), in MB

Sets the size of the RAM area used for documents of each index (might be several if stemming used) that have been inserted but not yet saved to disk.

Specsymbols

/config/index/specsymbols

Character sequence (default '_')

Sets the list of characters treated as parts of words. By default all non alhpa-numeric characters are treated as seperators. For example if there is email address in text "nobody@nowhere.com" then this will be splited in 3 words: "nobody", "nowhere", "com". If @ symbols is set as specsymbol then parser willwill split it in 2 parts: "nobody@nowhere" and "com". Example of multiple characters: <specsymbols>#@&</specsymbols>

Tag colocation distance

/config/index/colocation_distance

Positive integer (default is 10000)

Step for shifting word positions to make colocation feature work. This parameter should be increased if word count in colocated tags overcomes this value.

Policy in document

/config/index/parse_indoc_policy

yes/no (default no)

Enables/disables policy parsing for documents which contain policy in itself.

 

 

Glossary

A

API

Application programming interface.

B

Boolean expression

Expression containing logical operators like AND, OR, and NOT.

C

Clusterpoint API

Standardized set of functions for accessing the Clusterpoint Server.

See also: Clusterpoint Server.

Clusterpoint Server document

Unit in Clusterpoint storage against which searching is performed. It can be unstructured or XML structured.

See also: Clusterpoint storage.

Clusterpoint Server

Stand-alone server for storing and retrieving information such as plain texts or XML structured documents. It can be run in one or more instances per computer.

Clusterpoint storage

Data collection for storing Clusterpoint Server documents in a format that ensures a search is performed very fast. Clusterpoint storage is contained by one Clusterpoint Server instance, and consists of vocabulary, document repository, and Clusterpoint Index.

See also: Clusterpoint Server document, vocabulary, document repository, Clusterpoint Index.

case-support

Feature that allows distinguishing between uppercase and lowercase characters.

D

demon

Program or process, part of a larger program or process, that is dormant until a certain condition occurs and then is initiated to do its processing

document repository

Place where all Clusterpoint Server documents are stored in the format, in which they were stored in the Clusterpoint Server system, for returning the documents on a search request.

See also: Clusterpoint Server document.

F

FTS

Full text search.

See also: FTS query.

FTS query

Full text search request to the Clusterpoint Server.

See also: FTS.

fuzzy search

Feature that allows searching for words that sound the same but are spelled differently.

I

Clusterpoint Index

List of words, where each word has a list of pointers to Clusterpoint Server documents in which the word occurs.

See also: Clusterpoint Server document.

M

markup search

Feature that allows searching for words within specific markup.

multi-language support

Feature that ensures the documents in different languages and character encodings can be stored and searched within one Clusterpoint storage.

See also: Clusterpoint storage.

P

policy

Set of document structure parts specific operations for data importing and retrieving to the Clusterpoint storage, special indexing methods, relevance weights rules, listing rules in search results, results grouping rules etc.

See also: Clusterpoint storage.

R

RAM

Random access memory.

rate

Mechanism, which ensures that results are ordered in a result set according to assigned unchanging rate.

relevance

Measure of the accuracy of the search results.

See also: rate.

S

snippet

Fragment with an occurrence of a search term.

specific weight

Weight of a word in a document that is calculated according to the specific weight interval of the document part the word occurs.

See also: specific weight interval.

specific weight interval

Integer range assigned to a document part that denotes the importance of the document part compared to other document parts.

See also: specific weight.

stemming

Feature that allows searching for words and their declinations.

U

UTF

UCS (universal character set) transformation format.

V

vocabulary

Vocabulary is a list of all unique words in the Clusterpoint storage. Unique words are found in documents and added to the vocabulary while storing these documents to the Clusterpoint storage.

See also: Clusterpoint storage.

W

weight

Relevance defining relative integre number from 0 to 100, used to rank XML document parts against each other

wildcard search

Feature that allows searching for unknown characters or phrases.

X

XML

Extensible markup language.

XML reply

XML message that is returned when submitting a XML request.

See also: XML request.

XML request

XML message that is sent to the Clusterpoint Server to perform a Clusterpoint API command.

See also: Clusterpoint API.

 

65K – bloks

50-bucketi / main-N-index 400GB / 2gb bucketi rep-NNN- _code

 

 

 



[1]               A demon is a program or process, part of a larger program or process, that is dormant until a certain condition occurs and then is initiated to do its processing.