|
|
|
|
|
Clusterpoint DBMS |
|
version 2.0 |
|
© Clusterpoint Ltd. 2006-2011. All rights reserved |
This documentation is provided as part of the Clusterpoint Server and Clusterpoint DBMS systems.
The content of this document is provided for reference use only, is subject to change without notice, may contain technical inaccuracies or typographical errors, and should not be construed as a commitment of Clusterpoint Ltd..
Clusterpoint Ltd. may make any improvements or changes in the product described in this document at any time without notice.
2006 – 2011 Clusterpoint Ltd. All rights reserved.
Use of this documentation is subject to the following terms:
The content of this documentation may not be altered or edited in any way. Only conversion to other formats is allowed.
You may create a printed copy for your own personal use.
For all other uses, such as selling printed copies or using (parts of) the documentation in another publication, prior written agreement from Clusterpoint Ltd. is required.
All brand names and product names used in this document are trade names, service marks, trademarks or registered trademarks of their respective owners.
Email: support <-> @ <-> clusterpoint.com (please ignore markup added for spam filtering)
Website: www.clusterpoint.com
2. Understanding Clusterpoint Server
Document Structure............................ 27
2.4.3.3. Results
Grouping and Ordering according to Clusterpoint Information Ranking...... 42
4.5.2.1.1.13. Numeric Search in More Than One Tag................................................ 67
This
preface is an introduction to the Clusterpoint Server (CPSE) Developer’s Guide.
It defines the audience, describes the structure of this guide, and lists
typographic conventions and abbreviations used throughout the guide.
This
guide is compliant with the Clusterpoint Server version 2.0.
Clusterpoint
Server (CPSE) is part of the Clusterpoint Data Base Management System (DBMS). It is a database server engine software
written in C/C++ supporting Clusterpoint API.
It works as a transparent cluster database software (same copy on all
hardware computers).
Clusterpoint
Server operates as a database server software on any commodity 32-bit or 64-bit
computer hardware.
Clusterpoint
API (application programming interface) is XML-based interface protocol between
all applications and Clusterpoint Server software.
Clusterpoint
Manager is administration and configuration application developed in PHP as a
Web server module, that communicates with all Clusterpoint Server software
instances, installed across cluster.
This
section contains the following topics:
·
Audience
This guide is intended for application
developers using Clusterpoint Server as a corporate search technology platform
for building and operating customized applications based on Clusterpoint API.
This guide has the following structure:
|
Section |
Description |
|
Describes Clusterpoint Server, its concepts, and architecture. |
|
|
Describes Clusterpoint Server document structure and presents scenarios for unstructured source data and XML structured source data. |
|
|
Describes multi-language support and text encoding concepts, and explains XML formatting concepts. |
|
|
Contains all Clusterpoint Server function descriptions and syntax in XML. |
|
|
Describes Clusterpoint Server clustering. |
|
|
Lists and describes sample applications that are based on Clusterpoint API commands. |
|
|
Contains a list of error messages. |
|
|
Contains a list of frequently asked questions and answers to them. |
The Clusterpoint DBMS documentation
includes the following guides:
|
Title |
Description |
|
Clusterpoint
DBMS Developer's Guide |
Describes how to develop custom database storage, search end indexing applications for Clusterpoint Server core software included into Clusterpoint Server package |
|
|
|
The following styles and conventions are used
in this guide:
|
Convention |
Description |
|
Verdana |
Represents command, function, file and directory names, system messages, and command-line commands. |
|
Hyperlink |
Represents a hyperlink. Clicking on this field takes you to the identified place. |
|
Example |
Represents an example. |
|
Source code |
Represents code. |
|
Comment |
Represents a comment in the code. |
The following abbreviations are used in this
guide.
|
Abbreviation |
Description |
|
Clusterpoint Server |
Clusterpoint Server (hardware + crawler and search engine software) |
|
Clusterpoint Server |
Clusterpoint Server - core Clusterpoint DBMS server software installed on any hardware that is networked into private cloud architecture to form a cluster and can be used for combined storage, RAM and CPU resources by Clusterpoint Server software when clustring functionality is used. |
|
MANAGER |
Clusterpoint Manager - Web tool for centralized administration, configuration and monitoring of all Clusterpoint Server systems: server instances in RAM servicing storages (databases), database storages, cluster storages, and underlying cluster equipment resources. |
|
API |
Application programming interface. |
|
FTS |
Full text search. |
|
XML |
Extensible markup language. |
|
HTML |
Hipertext markup language |
|
SQL |
Structured query language. |
|
UTF |
UCS (universal character set) transformation format. |
|
HTTP |
Hypertext transport protocol. |
This
guide introduces Clusterpoint Server from an application developer’s
perspective and provides reference material for building customized
applications based on Clusterpoint Server.
This
section includes the following:
·
What is Clusterpoint
Server?
·
Clusterpoint Server
in Corporate Networks
·
Understanding Clusterpoint
Server Environment
·
Concepts
·
Clusterpoint Server
Architecture
·
Features
Clusterpoint Server is the core database
storage and search engine, part of the Clusterpoint database management system
(DBMS). Clusterpoint Server is providing
data base information storage, access, search and retrieval used in Clusterpoint
DBMS product line. Clusterpoint Manager
is an application for administration of Clusterpoint Databases, and is being
included as part of Clusterpoint DBMS as well.
Other Clusterpoint products include Clusterpoint
Crawler, Clusterpoint Searcher, Clusterpoint Manager, Clusterpoint Network Traffic Surveiilance System and other vertical application
sector technology solutions and products provided by Clusterpoint Ltd. Those products may include full or partial
Clusterpoint DBMS, or be integrated for cohesive use with Clusterpoint Server.
The Clusterpoint Server system consists of the Clusterpoint
Server and application programming interface (Clusterpoint API) for building
information storage and retrieval applications.
The Clusterpoint Server is an operational
platform that performs information storage, access, search and retrieval tasks
by executing a predefined set of commands.
Clusterpoint API is used for developing
applications that are specific and customized according to your company needs.
Note: In
this Guide sample codes for building software applications for Clusterpoint
Server are delivered for most common source data formats and retrieval
scenarios. It is not possible to cover
all possible scenarios, therefore in this guide we provide you only the basic reference
material for building your own applications.
Nowadays, data amounts in companies are
increasing very rapidly. A lot of data
contains textual information, especially in web applications. It is either unstructured (texts, emails,
documents) or semi-structured (text with some structural markup).
Another hardware technology driven advantage is
that databases become more and more document-oriented, without splitting data
among multiple tables and columns. By
keeping all the relevant data in one place, it is more simple and easy to
manage huge databases. Clusterpoint
Server is also a document-oriented database platform, and work on XML document
collections we simply call “storages”.
One of the key ways how to effectively retrieve
such data from document collections and, therefore, make the data usable, is
full text search (FTS). Full text search probably is one of the key competitive
advantages of Clusterpoint database technology.
It underlies a very powerfull methodology implemented in Clusterpoint
Server for information indexing and searching: Clusterpoint Information
Ranking.
Full text search in Clusterpoint Server is
based on an optimized mathematical model and our own original data ranking
algorithms, which ensures very high performance for searching structured,
unstructured or semi-structured information in large amounts of data, compared
to traditional legacy SQL systems. For this purpose, in Clusterpoint Server, all
data are stored in a special type of index: Clusterpoint Index, which is a
cohesive combination of Clusterpoint Index, a graph database index and
relational database indexing models.
For more information on how data are stored in Clusterpoint
Server, see Understanding Storing Information in
Clusterpoint Server.
Subjects for full text search can be any
unstructured and XML structured data, for example, text collections, separate
phrases or words in text documents, Web pages, Web addresses, several special
markups for textual and numerical data, bookmarks of HTML or XML pages, domain
names, SQL database entry key IDs, file names, XML field or tag names, and so
on.
The following figure illustrates the Clusterpoint
Server system from a high level:

Figure 1: Clusterpoint Server operational diagram
In Figure 1, all data storing, manipulating and search is implemented
by Clusterpoint Server software running on the hardware, for example, it performs database search queries, update requests,
status and control commands, and so on.
This functionality has been built in Clusterpoint Server core server
software and can be accessed through Clusterpoint API.
Clusterpoint Server system can be integrated
into an existing corporate network system. The Clusterpoint Server is
incorporated into the network system just like any other server computer or an appliance.
The following figure describes a sample corporate network with the Clusterpoint
Server software installed hardware:

Figure 2: Clusterpoint Server in a corporate network
Application servers and transaction processors
from an existing corporate network can access the Clusterpoint DBMS core
software - Clusterpoint Server - as active Clusterpoint API clients; in that
case, for security reasons, end users cannot directly access the Clusterpoint
Server. In that way, Clusterpoint Server can be used in any corporate network,
independently from the operation system, database environment, or programming
language used for application development.
Normally security partitioning is done on application server level, and
Clusterpoint Server can do only database storage and search functionality.
For large data amounts, Clusterpoint Server core
software supports generic sever clustering, which delivers both performance
scalability (for indexing), search scalability (for workload sharing among
multiple identical copies of database running on different hardware) and data
volume scalability (if database is clustered out into N parts or shards, each
containing 1/Nth of total database content).
You can also combine those clustering options.
For more information on Clusterpoint Server
multi-server architecture, see Multi-Server Architecture.
This section briefly describes Clusterpoint
Server software environment from developer's perspective.
This section contains the following topics:
·
Overview
·
Accessing Clusterpoint Server
The Clusterpoint Server is core software in the
Clusterpoint DBMS system that performs data storage, search and retrieval. There is a predefined set of
commands (Clusterpoint API) that are understood and executed by the Clusterpoint
Server.
The commands are implemented as Clusterpoint
API XML messaging requests and transported from/to Clusterpoint Server via HTTP
POST or GET messages.
Note: For security reasons applications can
access Clusterpoint Server through HTTPS protocol, if necessary. Installation of digital certificates on the
Web server needed for HTTPS support is not covered by this manual. Everything related to HTTP messaging works
also for HTTPS messaging.
For sending API commands to the Clusterpoint
Server, first these commands must be created.
From the application developer point of view there are two different basic
methods of creating and executing Clusterpoint API commands.
Method 1 - direct construction of Clusterpoint
API XML request messages and using HTTP POST method to send the request to Clusterpoint
Server. This can be done by any
programming language supporting string operations and Web services http
protocol.
Method 2 - using HTTP GET method to send Clusterpoint
API commands included into URL request with CGI parameters, where those CGI
parameters follows Clusterpoint API syntax and supported Clusterpoint API
command set.
In both cases your application has to
communicate over HTTP protocol with Clusterpoint API gateway module called 'cpse-gw.cgi'. This module automatically processes GET or
POST data depending on which method you used.
This module is installed and present on each Clustepoint Server software
installed hardware (on each cluster node).
The following diagram describes how
the Clusterpoint Server (CPSE) is accessed:

Figure 3: Accessing Clusterpoint Server
The following steps provide a general
description of how the Clusterpoint Server is accessed:
1. Users enter commands, such as search
queries, for the Clusterpoint Server from a user interface of an application,
for example, a Web search form.
2. A custom built application calls a Clusterpoint
API command with its parameters.
3. The Web server receives HTTP request
and passes it to the Clusterpoint Server Web server module (cpse-gw.cgi) using Comman Gateway
Interface (CGI) of Web sever.
4. The Clusterpoint Server Web server
module translates each user command into an Clusterpoint XML request and
submits it to the Clusterpoint Server via UNIX domain sockets (internally).
5. The Clusterpoint Server responds to
the application returning Clusterpoint XML replies that are optionally
formatted, using XSLT stylesheet, or favorite scripting language (Java, PHP,
Ruby on Rails, JavaScript etc.) and then
can be displayed and viewed through the application user interface, for
example, a Web page.
If you have large complex documents or custom
databases which you want to turn into searchable database, usually POST method
is used to send this information to Clusterpoint Server. You can create Clusterpoint XML messages at
the application side, e.g., during indexing of custom data into Clusterpoint
Server. Using POST-based Method 1 you
can index very large documents. GET
method has limits, depending on which web server you use, can be as low as 2KB
in some rare cases.
Note: HTTP GET command has restrictions on the
total length of all CGI paramaters present into URL. In case of XML data messages of the size
4Kbytes or larger, you should always use HTTP POST method to communicate with Clusterpoint
Server.
Method 2 is better for doing search queries and
performing document retrieval as simple HTTP GET commands. There is a server side mechanism that reads
HTTP GET parameters and composes Clusterpoint XML request messages from the
included parameters. You do not have to
worry about creating XML request messages in your application software, but
only have to pass right parameters in URLs, from which Clusterpoint XML request
messages are automatically created and internally sent to the Clusterpoint
Server for processing.
One can say that HTTP GET command invokes
something like a server-side pre-processor that creates the same Clusterpoint
XML messages understood by Clusterpoint Server core software as if you would
have used HTTP POST method.
Using HTTP POST method also gives you slightly
better performance if you already have built a database of Clusterpoint XML
formatted documents and just want to send them for indexing. If you use some scripting language (such as
Java, Php or Perl) for application development, using HTTP GET command can be
slightly faster for querying or indexing short data items. Clusterpoint XML message is always being
constructed on a server side module written in high-speed C language. However, performance differences usually can
not be significant and our reccommendation is to choose which method suits best
your application needs.
In a similar way, there is also a standard mechanisms
that formats XML reply messages received from the Clusterpoint Server, for
example, by using an XSLT stylesheet. Again, it is your decision whether to
handle XML reply messages on the application side, or to create an XSLT
stylesheet using which results received from the Clusterpoint Server are
automatically formatted and can be directly passed to end users.
This section introduces and briefly explains Clusterpoint
DBMS software platform concepts that readers must be familiar with before going
into details.
This section contains the following topics:
·
Clusterpoint Server FTS Capability
·
Clusterpoint Server Web Server Module
·
Clusterpoint Server Demons
·
Clusterpoint Server Vocabulary
·
Clusterpoint Server Document
Repository
Clusterpoint Server is a stand-alone server for
storing and retrieving information such as plain texts or XML structured
documents. It can be run in one or more instances per computer.
For more information, see Multiple Storages Architecture.
Clusterpoint Server is designed to support
retrieving information stored using full text search queries with rich
enterprise search functionality.
Clusterpoint Server application programming
interface (API) is a standardized set of commands for accessing the Clusterpoint
Server.
Clusterpoint Server Web server module is a
module integrated with the built-in Web server that receives requests from an
application through the Web server via HTTP POST or GET and dispatches them to
the Clusterpoint Server via UNIX domain sockets.
Also, functionality of composing XML request
messages from HTTP GET or POST parameters and optional formatting of the XML
reply messages with a given XSLT stylesheet is included the Clusterpoint Server
Web server module.
The Clusterpoint Server system is designed so
that the Clusterpoint Server module can be integrated with the Web server
through the Common Gateway Interface (CGI).
Optional Clusterpoint Server gateway module interfaces for Apache API or
FastCGI are available, which can further increase the performance of the whole
system.
Clusterpoint Server demons are internal UNIX
processes implemented as part of Clusterpoint Server platform core functionality
and used to store and retrieve documents.
In Clusterpoint Server Version 1.0 there were
the following demons:
·
document
or data handling demon 'cpse-dat'
·
a
demon to create and search Clusterpoint Index (index demon 'cpse-idx')
·
a
demon to build and use vocabulary of unique words (vocabulary demon 'cpse-voc')
·
a
demon to manage overall system communications between all server demons on a
single server and within cluster topology (manager demon 'cpse-mgr').
In Clustepoint Server Version 2.0 to improve
performance and cut down interprocess communications, we have reduced number of
demons per storage to only two:
·
a
document storage, indexing and search demon, handling also vocabulary for that
particular storage 'cps2-storage'
·
a
master demon to manage overall system communications between all demons on a
single server and within cluster topology (manager demon 'cps2-master').
All demons can be run as multiprocessor and
multithreading processes capable of effective utilisation of available hardware
resources.
Clusterpoint Server document is a basic unit in
the Clusterpoint storage (database) against which searching is performed. It
can be unstructured (pure text) or XML structured, the later also can incorporate
combination of both models, making a document semi-structured (for example, a newspaper article full text with meta structure
such as author, date, source etc.).
Clusterpoint storage is a named database (if
you wish - a collection of XML documents
as basic database data objects) for storing Clusterpoint Server documents in a
format that ensures that each
document is uniquely identified by Clusterpoint document ID and search can be performed very fast
across all documents stored in that named storage.
Each Clusterpoint storage is serviced by one Clusterpoint
Server software instance in RAM, and on the disk consists of Vocabulary, Document
repository, and Clusterpoint Index.
Multiple different storages (databases) can be
run on a single hardware computer in parallel, serviced by separate Clusterpoint
Server instances, with their own users and access rights.
This virtualization architecture of
Clusterpoint database platform can be efficiently used for safe partitioning of
databases servicing different applications, all running on the same hardware
equipment and efficiently using shared CPU, RAM and disk resources.
You do not need hypeervisors or other
virtualisation software to run parallel databases on the same hardware with
Clusterpoint software: Clusterpoint Server architecture provides this
functionality out of the box.
Clusterpoint Server Vocabulary is a list of all
unique “atomic” elements in the particular Clusterpoint storage: text items,
strings, numbers, email addresses, dates, XML tags, URL or URI links, etc. We sometimes call those basic elements
“words” for simplicity of understanding.
Please note that a term “word” is not linguistic in Clusterpoint
architecture. It is a technical term
describing any unique string found in your custom XML data objects, which are
stored into Clusterpoint Vocabulary similarily like spoken language words are
organized into real vocuabulary.
Clusterpoint Vocabulary is created by Clusterpoint Server during indexing: it
splits all XML documents into “atomic” elements separated by delimiter
characters, which can be custom configured for each named storage. Those “atomic” elements we call “words”. They are added to the vocabulary while
storing these documents to the Clusterpoint storage. Each Clusterpoint storage builds its own Vocabulary
when new documents are being added or updated. Each word in the vocabulary has
an ID of the integer type assigned to it.
Vocabulary of all unique words actually indexed
per storage is stored in RAM for better performance.
Each storage has its own Vocabulary, and can be
configured and fine tuned for indexing separately through Storage Configuration
file.
Clusterpoint Server Document repository is a
place where all Clusterpoint Server documents are stored in the original XML format,
in which they were stored in the Clusterpoint Server system, for returning the
documents on search and data retreval requests.
All documents must be uniquely identified by
Clusterpoint Document ID which must be unique per named storage. Please note, that Dcocument ID should be
chosen by customer in such a way, that it is unique string value per named
storage, possibly spanning multiple cluster nodes (on each cluster node storage
with the same name will be treated as part of the entire named storage, and
therefore Document ID must be unique per all cluster).
Each Clusterpoint storage has its own Document
repository which can be a single server repository or consisting from multiple
cluster nodes. Cluster storages (same
name storages on different servers, configured to work as a single logical
cluster database in Clusterpoint architecture) has a distributed Document
repository on all cluster nodes forming a single cluster storage, where
Document repository in a particular cluster node contains only documents stored
on that particular cluster node.
Clusterpoint Index is a customized pre-sorted
index of all XML document basic “atomic” elements: words, strings, email
addresses, numbers, urls etc., where each basic element has a list of pointers
to Clusterpoint Server documents in which the element occurs, relations with
other document parts, and for numeric and date sorting also traditional range
indexes. Clusterpoint Index is pre-sorted index ranked for fast and relevant
search using Clusterpoint customizable Information Ranking mechanism.
Clusterpoint Index ensures fast structured
search and full text search (FTS) functionality with possibility to build
different logical expressions when performing a database search.
Each Clusterpoint storage has its own unique Clusterpoint
Index, which is always organized at data storage level according to customer own
defined information ranking rules and algorithms for the particular
database.
Clusterpoint DBMS provides programmable
mechanism to custom rank your database content to create Clusterpoint
Index. Information ranking provides ultra fast sub-second Internet-style
ad hoc search for FTS queries in a cluster and delivers the most relevant search
results grouped and sorted upfront, for example, on the first results page. Even when user queries are performed with few
simple known search keywords, search query results are sorted for relevance by
customer own information ranking rules using pre-sorted Clusterpoint
Index. Users performing any custom
database ad hoc search get nearly instant (sub-second) Clusterpoint database response with remarkably meaningful and relevant
search results, grouped and ordered for the best search experience from the
customer point of view. This feature of
Clusterpoint Index also enables linear search scalability in large cluster databases,
where ranked ad hoc search can be performed without performance loss
characteristic to legacy SQL databases.
Please see section 2 describing Clusterpoint
Information Ranking in details.
This section describes Clusterpoint Server from
various architectural perspectives.
This section contains the following topics:
·
Client — Server Architecture
·
Multiple Storages
Architecture
·
Understanding Storing
Information in Clusterpoint Server
·
Indexing Documents in Clusterpoint
storage
·
Querying Clusterpoint storage
The following figure describes Clusterpoint
Server architecture from the client — server perspective:

Figure 4: Clusterpoint DBMS client — server architecture
From the client — server perspective, the Clusterpoint
DBMS system consists of the client part and the server part.
On the client side, developers and administrators
work with an application server, customer web application or Clusterpoint
Manager utility, which initializes Clusterpoint API command calls and sends
them to the Clusterpoint Server via HTTP.
On the server side, the Clusterpoint Server
executes these Clusterpoint API commands accessing data in the Clusterpoint
storage and sends a reply back to the client side’s application server.
The following figure describes how multiple Clusterpoint
Server instances can be run on a single Clusterpoint software installed
hardware (cluster node):

Figure 5: Clusterpoint Server multiple storages architecture
Multiple instances of the Clusterpoint Server
can be run on a single computer, which each works with its own Clusterpoint
storage. This Clusterpoint DBMS system
architecture is for running customer database applications in a virtualized
environment, where each server instance securely uses its own RAM and disk
space, servicing only its own storage, with its own users and access
security. There is no virtualization
software necessary, you can run as many as different Clusterpoint storages in
parrallel on the same hardware, as your hardware capacity allows. This also enables to utilize all hardware
capacity for different database applications.
Using Clusterpoit Manager application web-interface, you can securely
create, run, see and manage only particular storages, or all or of them, on any
hardware servers installed in cluster and managed centrally.
To ensure scalability of larger amounts of
data, the Clusterpoint Server can be clustered sharing a single Clusterpoint
storage across many computers.
The following figure describes how a single Clusterpoint
storage can be distributed on several Clusterpoint Servers:

Figure 6: Clusterpoint Server multi-server architecture
For more information on Clusterpoint Server
clustering, see Clusterpoint Server Clustering.
The following figure describes how data are
imported and stored in Clusterpoint Server:

Figure 7: Storing information in Clusterpoint Server
1. Data are entered by end users in
custom built applications.
2. Using Clusterpoint API commands data
are submitted to the Clusterpoint Server via HTTP.
3. From the submitted data, the Clusterpoint
Server creates an Clusterpoint Index, Vocabulary, and Document (your original
XML data objects) repository, which all are contained by the Clusterpoint
storage.
For more information on the Clusterpoint
storage, see Storing and Idexing Documents in Clusterpoint
storage and Querying
Clusterpoint storage.
Note: This
section contains some of Clusterpoint Server system implementation details.
Description provided in this section is very general and does not include
implementation details for all Clusterpoint Server functionality.
Note: The
knowledge provided in this section is not required for Clusterpoint Server
application developers. However, it can be found useful for a better
understanding of the Clusterpoint Server system.
The following figure describes general
principles how a document is stored and indexed in the Clusterpoint storage:

Figure 8: Indexing documents in Clusterpoint storage
1) When the Clusterpoint DBMS receives
an XML request containing document that must be imported in the Clusterpoint
storage, the Clusterpoint DBMS Master demon[1] (cps2-master)
authenticates and performs invoking of the particular strorage server instance,
servicing each storage (cpse2-storage).
2) The Master demon then sends the
document to the Storage demon. Storage
demon parses the XML request and splits it into “atomic” elementar parts:
words, strings, tags, numbers, urls, emails etc.
3) The Storage demon stores the
document in the Document repository, assigns a unique ID of the integer type to
each document.
4) The Storage demon stores all “atomic”
data from the document also to the RAM located Vocabulary, then translates all “atoms” in the document to
unique IDs of the integer type and stores them to the Clusterpoint Index.
5) The Clusterpoint Index is
constructed from the document ID and all IDs of the “atomic” elements contained
in the Clusterpoint Vocabulary, applying the Clusterpoint Information ranking
rules, defined by customer in the Document Policy configuration file (see
below).
6) The resulting Clusterpoint Index
links all “atomic” element IDs with the document ID and inserts them in the Clusterpoint
Index, at the same time sorting index data according to the particular
customized Clusterpoint Information ranking sort order, created a set of
pre-sorted and highly optimized for sequential disk access indexes which later are
used for ultra-fast, cluster-wide and relevant search.
Note: This
section contains some of Clusterpoint Server system implementation details.
Description provided in this section is very general and does not include
implementation details for all Clusterpoint Server functionality.
Note: The
knowledge provided in this section is not required for Clusterpoint Server
application developers. However, it can be found useful for a better
understanding of the Clusterpoint Server system.
The following figure describes general
principles how a query is processed in the Clusterpoint storage:

Figure 9: Querying Clusterpoint storage
1) When the Clusterpoint Server
receives an XML request containing a query, the Master demon (cps2-master) receives and authenticates
the XML request.
2) The Master demon then sends the
query to the Storage demon (cps2-storage).
3) The Storage demon translates all
query terms from the query to “atomic” element IDs and looks up Clusterpoint
Vocabulary and Clusterpoint Index, searches and returns from the Clusterpoint
Index a list of document IDs, which are linked to the query term IDs, in a
ranked, grouped and sorted order according to the particular Clusterpoint Index
ranking rules for search relevancy.
4) The Storage demon uses the list
of document IDs to retrieve from a Document repository a result set containing
a document list that matches the query, and sends it to Master demon to return
it to the calling application.
Clusterpoint Server is designed to comply with
the following standards:
|
Standard |
Reference |
|
XML 1.0 |
|
|
UTF-8 |
RFC 2279: UTF-8, a transformation format of ISO 10646 |
|
HTTP |
Hypertext Transfer Protocol |
|
XPath 1.0 |
The Clusterpoint Server features are listed and
referred to a section in this guide, in which it is described, in the following
table:
|
Title |
Section |
|
FTS |
|
|
Relevance |
|
|
Multi-language support |
|
|
Case support |
|
|
Boolean expressions |
|
|
Stemming |
|
|
Wildcard search |
|
|
Fuzzy search |
|
|
Markup search |
This
section describes Clusterpoint Server document structure concepts and explains
strategies, if source data that you want to import into the Clusterpoint Server
system, are unstructured, and if the source data are XML structured. It also
describes document ordering and grouping according to Clusterpoint Information
Ranking. It also describes language and
text encoding concepts.
This is probably
the most important section of the whole Developer’s Guide.
Please
read it very carefully, and if your had difficulties to “grasp” our concepts,
which, we know, radically differs from relational database world, then we
encourage you to read this section over again.
If you
still have difficulties to understand all concepts, please do not hesitate to
contact us on support << @ >>
clusterpoint.com.
This
section contains the following topics:
·
Overview
·
Creating Document Structure
with Application
·
Importing XML Data with Custom
Structure
·
Document Ordering according to
Clusterpoint Information Ranking and Result Set
As mentioned previously, any data can be stored
in the Clusterpoint Server system and then retrieved using FTS queries (simple
Internet-style ad hoc queries with any known to user query terms). Data are
stored in the Clusterpoint storage as customer defined custom schema-less XML documents.
A Clusterpoint Server document is the smallest unit in the Clusterpoint storage
against which searching is performed. When a search request is submitted, the Clusterpoint
Server searches within the Clusterpoint storage and finds all documents that
match the query.
Abstracting from specific content, format, and
structure, we assume that data in existing corporate filings, databases, or
storages can be perceived as documents that each have a unique document ID,
title, and a content consisting of textual (for good human readability) and
possibly XML marked up information. This
is data in which database search, including rich enterprise full text search
(FTS) functionality can be performed in Clusterpoint architecture.
An document ID can be a simple integer, an
alphanumeric character string, a full file path on a file server and the file
name, a URL of a Web page, or any other element that uniquely identifies a
document. It should be unique string for
each document per each cluster storage (a named storage spanning multiple
servers, only one cluster node can contain document with a particular ID).
Often there are also other elements; however,
we will talk about them later. Also we assume that when performing a search
request, what a user expects to have as a reply is a list of IDs, titles and
short descriptions of those documents, which match the search request.
The Clusterpoint Server system supports the
assumed default elements for importing and retrieving data. It saves a lot of
time when users are performing ad hoc FTS without any predefined search forms
or pre-programmed API calls. A
user-friendly FTS search queries always return default assumed data, and then
it does not require to do any application level programming.
As in relational databases you can customize
the list of XML items returned per each search query. It can be done though the Document Policy
file, which is very small XML configuration file. Using Document policy, customers can define for
each storage items to be listed per simple search queries and many other default
parametrs to reduce application level programming for their database search
applications. Please see below more
details about Document policy configuration file.
The following sections describe how documents
are imported in the Clusterpoint Server system if they are not XML structured
and if they are XML structured.
When importing data to the Clusterpoint storage
using the Clusterpoint API functions, the default document elements: ID, title,
and content, are passed to the Clusterpoint Server as parameters of respective
functions. When calling respective Clusterpoint API command for storing a
document in the Clusterpoint storage, elements of the document are enclosed in
XML tags and sent to the Clusterpoint storage.
The following figure illustrates this process:

Figure 10: Storing data in the Clusterpoint storage
The following table lists and describes all
default elements for a typical Clusterpoint Server document. Those elements specified between tags:
<document> ... </document> can be used as Clusterpoint Server
document data definition how database engine should index and search documents
of this structure.
|
Element |
Description |
|
<id> |
Unique document identifier in which FTS is not performed. |
|
<title> |
Document title in which FTS can be performed. |
|
<rate> |
Value of the integer type in which FTS is not performed assigned to a document with a respect to other documents. When performing a search request, search results will be ordered by rate, if not by relevance. For more information on document ordering, see Document Ordering in Result Set. |
|
<group> |
Document group. This element can be used to denote a domain of a Web document or other grouping tag, as well as, it can be used as a classifier for any kind of documents. When performing a search, it is possible to limit the number of documents from one group in the search result. |
|
<text> |
Textual information in which FTS can be performed. Clusterpoint Server also supports XML marked up information and preserves the markup, when searching in it. A snippet, which is a fragment with an occurrence of the search term, is returned to the search results. |
|
<hidden> |
Textual information in which FTS can be performed, but for which a snippet is not returned to the search results. |
|
<info> |
Additional information added to a document, but in which FTS is not performed, for example, picture files, MS Word or PDF document files, and so on. Note that these files must be appropriately formatted. For information on appropriate formatting, see Formatting XML Special Characters. |
Extracting and defining these elements from source
data before importing the data to the Clusterpoint Server system is an
application task.
Please note that the naming of XML data
elements <id>, <title>, <rate>, <text>, <hidden>
and <info> is purely our own and shown for this guide purposes only. When you create a new storage, using
Cluserpoint Manager, a new default Document policy configuration file is also created,
containing those default XML elements as described above.
You are free to choose any other names of
elements for your own database XML document, especially, if you already have
some database with your own naming schema.
For example, you can define as document ID your custom tag <url>,
or instead of <rate> you can use for the same purpose your own <timestamp>,
<votes> or <date> tag.
Document Policy file enable you to re-assign the same functionality from our default names to any other
named tags. Default names are only for
this particular Guide, to help you to understand how Clusterpoint database
platform works.
If source data is XML structured, it is not
necessary to restructure it to the default Clusterpoint Server document
structure described in the previous section and Figure 10.
All XML tag names in custom user documents can
be named differently.
There is a special configuration mechanism in
the Clusterpoint Server how to tell the database engine
what instructions to perform when processing specified XML fields - called
parts of the document in this manual.
Clusterpoint Server uses this custom document
structure definition mechanism, called Document policy (it is a small
configuration file located in the same
directory as your storage data), to define the location and behavior for
each document part.
As a part of the document is considered any
opening XML tag and the same name closing XML tag enclosing some user
content. Simply speaking, a document
part is an XML tag and all content that it contains (between opening and
closing tags).
Before you can store the XML structured source
data to the Clusterpoint storage, the Document policy configuration for the Clusterpoint
storage must be defined. The existing policy is retrieved and a new policy is
defined for each Clusterpoint storage by Clusterpoint Manager Application, when
creating storages and configuring document policies for them.
By Document policy we understand a complete set
of operations for data importing, search ranking and data retrieving specified
as instructions to the Clusterpoint Server storage, what to do with your XML
orginal data files during indexing and search.
Your original XML documents stored in
Clusterpoint database themselves are not changed or modified. Document policy configuration file only
affects how Clusterpoint Server builds and uses database index: Clusterpoint
Index. Document policy configuration
file contains rules how to do this indexing and what is Clusterpoint Server
software “behaviour” at search and data retrieval, customizing it for your
needs. It also assigns default values to
save programming code in application software.
There are rules or instructions defined for
each part (each XML tag) of the document which define what instructions to
apply to the document part during indexing or search. Each rule can have a
different value set for the particular document part. A single value set is called 'property'. For example, the property id=yes means that information of this
document part will be considered as the document identification part, and the
property index=all means that information of this document part
will be indexed both as textual information and in addition also as textual
information with preserved XML markup to enable later filtered search within
this XML field content only. Property index=text means that information will be indexed
only as text, saving disk space. There
are many various property value sets in Clusterpoint database architecture, and
we constantly expand them to meet our customer new needs.
The following table lists all policy properties
with their values. The first value listed for a policy is the default value, in
other words, the value that are set if the policy is not specified for the
document part.
|
Property |
Value |
Description |
|
id |
no (default) |
Information within this part will be not considered as identifier of the document. The policy is not applied to this document part. |
|
yes |
Information within this part will be considered as identifier of the document (i.e. the primary key in legacy SQL terminology). An ID can be a simple integer, an alphanumeric character string, a full file path on a file server and the file name, the URL of a Web page, or anything other that uniquely identifies a document. There must be exactly one ID for a document and duplicate IDs may not exist per named storage (per cluster storage or per single storage). |
|
|
rate |
no (default) |
Information within part will be not considered as rate of the document. |
|
yes |
An integer number within this part will be considered as rate of the document. It is defining XML document relative ranking per storage against other XML documents, including in cluster storage, which is used at search for document sorting and grouping for output. Can be generated from any customer defined algorithm or formula reflecting customer business needs and customer database specific search requirements. Values in this part can be from 0 to 232. Values may be or may not be unique: ranking is relative. |
|
|
group |
no (default) |
Information within this part does not denotes a grouping classifier of a document, for which output grouping limits may be required to |
|
yes |
Information within this part is denotes a classifier for any kind of documents (e.g, a domain of a Web page) to limit output per group |
|
|
index |
no (default) |
Information within this part will be stored in the document repository and available for retrieval, however, it will be not indexed in the Clusterpoint Index. |
|
text |
Textual information contained within this part is added to the Clusterpoint Index and made available for FTS. |
|
|
xml |
Textual information contained within this part preserving XML markup is added to the Clusterpoint Index. In this case FTS will be performed according to the XML markup. |
|
|
all |
The two above (xml&text) applies to this document part. It consumes more resources of memory and longer indexing time. |
|
|
xml-text |
Information within this part will be stored in the document repository and available for retrieval for textual content of sub-level of all child tags. Useful, if for example, <address> is split in many subtags <city> <street>, etc. and you want to search across all of them with OR logic at this sub-level. Permits grouping of the child tag values of an XML document part under a single search path when performing search within markup. This can replace multiple OR operations with a fast single default ad hoc query search. |
|
|
facet |
This index type is used for categorizing documents in some type of hierarchy, for example directory structure. Data later can be accessed using XPath expressions, relative to this part. Only one part can be set as index to classify for document. See more information in chapter on XML drilldown (faceted search). |
|
|
xml-text&xml&text&facet |
Switches on all modes of index policy. It's possible to make exact selection of required indexing modes by joining them with symbol '&'. |
|
|
alias |
any valid
XML element tag name that is “virtual” (not present in customer XML document) |
If an alias is defined for a document part, the index will also record the contents of this part as if located in an virtual XML element named as the alias. Multiple XML parts can have the same alias element, therefore creating AND operation at search. Aliases can be used when performing search within markup. Information within this part will be added to <virtual-tag-name> part, which does not exist in the original customer document, but is created during indexing for consolidated search needs across different original tags, at different level of XML nesting. All document parts values with the same ‘alias’ name will be joined as a virtual text string for <virtual-tag-name> at the index level with a blank space character value as delimiter. Useful to consolidate for meaningful search at index level data on particular subjects such as persons, addresses etc., and perform search queries only within those virtual tags. Does not require to add ‘technical’ tags for customer data structure such as <hidden> described in Clusterpoint Server Version 1.0 Developer’s Guide. Avoids need to add complexity to customer XML document structure for technical reasons. For example, all database address elements (people, company, workplace etc.) can be combined into a single alias virtual tag, to create searchable through ad hoc search index for all addresses occuring within a database at different level of nesting and for different data objects. |
|
tag1&tag2 |
To set multiple alias tags use symbol '&' as seperator. Sometimes a single XML tag value needs to be present in multiple different alias tags (virtual XML tags, not present in data object, but created during indexing at Clusterpoint Index level for rich enterprise search needs), for example to enable relevant ad hoc search (full text search) only in few selected groups of database items, filtering out all other parts. Such alias groups may be multiple per database, enabling flexible customization for various ad hoc search needs. Multiple aliases can be selected and used when performing search within markup. |
|
|
weight |
<min–max> |
This policy works only together with the index policy with values: text, xml, or all. The range is from 1 to 100. All words contained in this part are explicitly set to be relevant to corresponding search term when performing FTS, using textual ranking as in enterprise search engines. If both min-max values are defined as relative interval, then query terms matching text in this document part is additionally ranked for content matching relevancy (e.g. closer terms matches in text are ranked higher within specified min-max interval, than terms that are in greater text distance from each other). If only a single weight number is set here, min and max are equal to it: textual ranking at search is switched off. A single weight value defines fixed XML structure ranking for a document part relatively to other parts. |
|
list |
no (default) |
Information within this part will be not listed in the search results. |
|
yes |
Information within this part will be listed in the search results. |
|
|
highlight |
Information within this part will be listed in the search results, but the search terms within this part will be highlighted. |
|
|
snippet |
In the search results, from this part only a snippet will be shown. The search terms within this part will be highlighted. |
|
|
index-numbers |
no |
Numbers found in this part is are not stored and indexed for numeric range search and sorting. Overrides default Storage configuration parameter. |
|
|
yes |
Numbers are indexed independently of text in this part for numeric range search and sorting. This part must contain numeric value. Overrides default storage configuration, number is treated as float. Standart range based index will be created for this part of document, assuming float values, enabling interval querying ‘mix..max’, and enabling additional results sorting at search in ascending, descending and geo-spacial order. |
|
|
int |
Numbers are indexed independently of text in this part for numeric range search and sorting. This part must contain numeric value. Overrides default Storage configuration, number is treated as integer. Standart range based index will be created for this part of document, assuming integer values, enabling interval querying ‘mix..max’, and enabling additional results sorting at search in ascending, descending and geo-spacial order. |
|
index-dates |
no (default) |
Information within part will not be additionally indexed as standard date timestamp based range index. |
|
yes |
Information within part will be additionally indexed as standard date timestamp based range index for date range search ‘from..to’. Dates are
indexed independently of text in this part. If there is more than one date in
this element, only the last date is indexed. The available formats are
(please contact us for other formats): |
|
|
exact- match |
binary |
The contents of the tag are indexed byte-to-byte for exact matching purposes. Information within this part must be treated as is, for search queries, including white spaces etc. Useful to search exact value string, for example, content from the beginning of the document part. |
|
|
text |
The contents of the tag are indexed as a set of words exact matching purposes. Punctuation and other marks are ignored. Case-insensitive. Information within this part will be treated as test with extra trailing and ending white spaces trimmed, and in lowercase only |
|
|
all |
Combination of both “binary&text”, there is API option to select one or another during search query |
|
|
none (default) |
The tag will not be indexed for exact match |
|
stem-lang |
en |
Defines that English language stemming rules to apply for this part of the document during search query, if stemming is requested by API. Configurable for other languages through language rules specific configuration utility that can be adopted for any other language. |
|
|
fr, es, pt, it, ro, de, nl, sv, no, da, fi, hu, tr |
Will stem in French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian and Turkish. Other language stemming requests are welcome. Please note that this language stemming mechanism works at API level for any search terms enclosed with ‘$’ as in $term$, automatically eapansing with OR all search query terms according to language stemming rules. There is alternative way how to provide good enough stemming for languages with different endings, with using string template queries like in ‘stemmin*’, that works on statictics coverage base using actually indexed unique words from Vocabulary, instead of using strict language specific stemming rules. In most cases users can use this more simple and understandable string template syntax for ad hoc search, without resorting to strict language stemming rules. |
|
|
none (default) |
The contents of the tag will not be available for stemmed search. |
An example of storage policy configuration
below can be seen as an illustration to how policy works.
There is a set of 'rules' for each part of the document, where each part
is defined in XPath notation in the policy file. Each rule specify one or more 'property' tags
for the respective part of the document with values described in the above
table.
In the example below the following custom
storage policy configuration for the document with XML parts 'id', 'title,
'rate', 'domain' and 'text' is defined (typical document for Internet data
storage).
<policy>
<rule>
<xpath>//document</xpath>
<property>document=yes</property>
</rule>
<rule>
<xpath>//document/id</xpath>
<property>id=yes</property>
<property>list=yes</property>
</rule>
<rule>
<xpath>//document/title</xpath>
<property>index=text</property>
<property>weight=90-100</property>
<property>list=highlight</property>
</rule>
<rule>
<xpath>//document/rate</xpath>
<property>rate=yes</property>
<property>list=yes</property>
</rule>
<rule>
<xpath>//document/group</xpath>
<property>list=yes</property>
<property>group=yes</property>
</rule>
<rule>
<xpath>//document/text</xpath>
<property>index=text</property>
<property>weight=10-89</property>
<property>list=snippet</property>
</rule>
</policy>
The policy defines how the search engine will
respond when indexing or searching the document storage containing data with
those document parts with defined rules.
Policy rules defines what part will serve as a unique document
identificator (id=yes), which parts will be listed in
every search list result found (list=yes),
and which parts will be indexed as text (index=text).
In the example above the policy property for
document part 'text' (list=snippet) also defines that a text snippet
should be generated for this part instead of the actual content. The policy for the part 'domain' (group=yes) defines that results with the
same value should be grouped and only
some first results shown in the search results list. The policy rules also defines that the
relevance of all found words in the document part 'title' should be higher (weight=90-100) compared to any such words found
in the document part 'text' (weight=10-89).
The example above shows very simplistic policy
definition. Actual configuration for the
Internet document storage can have very sophisticated and advanced policy
configurations with tens of rules and hundreds of properties, depending on the
end user needs and application specifics.
The (policy, rules, property values) shema is
very simple and efficient way how to flexibly accomodate all kind of
application needs for custom indexing and relevancy requirements.
You can build storages accomodating specialty
search needs for combination of text, XML and numeric data search, you can
design multi-zone relevance definitions for documents to improve search results
quality to your end users, you can define documents with special structure
having parts not to be indexed, you can manage the size of the index and
performance by fine tuning indexed and non-indexed parts etc.
Clusterpoint is adding new Document policy property
values for policy definitions along with new features supported by each new
version of databasse server core software.
Please
check for any new policy property values when new versions of Clusterpoint Server
software are downloaded or installed.
This section describes how documents are
ordered in a result set. It describes the two mechanisms conceptually and
contains the following topics:
·
Overview
·
Rate (document ranking among themseleves)
·
Relevance (XML data structure ranking)
·
Textual ranking
(full text search query ranking, a subset of Relevance ranking)
There are three basic information ranking mechanisms
in the Clusterpoint system how documents are ordered in a search result set:
·
By Document
<rate> value, which must be assigned by the application to each document
before storing it to the Clusterpoint storage and is independent from a search
request; this can be any customizable value or algorithm, assigning value for
ranking documents between themselves, if all other rankings atre equal;
·
By customer
assigned fixed relevance to specific XML structure parts, which is calculated
when performing a search and which ensures that documents that are matching to
a search request in that particular XML structure, if this part of document is
ranked higher, are displayed and grouped first in a result set; it ranks any
XML structure according to customer own needs;
·
By
textual content relevance to a particular search query, using advanced
enterprise search (full text search) contentual matching rules, finding the
documents that are closer matching the search query, depending on frequency and
possition of search terms within textual content in the particular XML part;
The Document rate ensures high performance of
the search function, the XML-structure relevance ensures the quality of FTS in the
structured search results (in XML structure) and textual relevance ensure
quality of textual content matching.
Actually XML-structure relevance (fixed
relevancy) and textual relevance (floating interval of relevancies) is designed
in such a way, as it produce one overall search relevancy score in combined
manner, and thus can be ordered at very high quality level for semi-structured
data most common on Internet and Web today.
Sorting by overall relevancy score is a default
mechanism that is applied every time a simple Internet-style ad hoc search is
performed.
Sorting only by document rate is an option that
you can choose additionally when a search is performed.
The decrease of performance due to the sorting
by rate is minimal.
The rate and relevance mechanisms are
illustrated by an example in the following figure:

Figure 11: Document rate and relevance
The query contains the search function that
must return documents containing the word 'houses'.
Each document in the Clusterpoint storage has
the Document rate assigned: the document A has a rate=5000, and the document B has the rate=3000.
The following table presents the sequence of
the two documents in the result set, when the search function uses the
relevance for document ordering, and when it does not, in other words, when the
relevance is on and when the relevance is off.
|
Document and its rate |
Relevance off |
Relevance on |
|
Document A rate=5000 |
1 |
2 |
|
Document B rate=3000 |
2 |
1 |
When searching with the relevance off, only the
Document rate is considered, and, therefore, documents with higher rates are
displayed first. In the example, the document A has a higher rate than the
document B, and, therefore, the document A is displayed first.
When searching with the relevance on, place
where the search term appears in the document is considered, and, therefore,
documents that contain the search term in parts that are more important than
other parts, in other words, have a higher specific weight, are displayed
first. In the example, the document A contains the search term in its text
part, whereas the document B contains the search term in the document title,
which has the higher specific weight than the text. Therefore, the document B
is displayed first.
The rate is a number of the integer type in the
range from 0 to 4294967295=232-1, which must be assigned by the
application to each document when storing it to the Clusterpoint storage.
The rate allows significant optimizations for
large data amounts, which ensures high performance of the Clusterpoint Server
system in massively clustered systems.
It is an application developer’s task to create
an effective document ranking algorithm for assigning document rate to document
collections that is appropriate and satisfies user needs and business rules,
for example, alphabetic order, by document publication or creation date, or objective
document importance, or customer project importance etc.
If the Document rate is not assigned or if
there are documents with the same Document rate, the default document order in
a result set is a reverse of the document storing sequence to the Clusterpoint
storage.
In a named Clusterpoint storage, multiple
Document rate parts can be assigned, and multiple assigning algorithms can be
used. At API level you can specify which
rate must be used for ordering. XML
parts used for additional (second, third etc.) ordering must be indexed with
Document policy ‘index-numbers’ or ‘index-dates’, and specifically requested
for this ordering during search by API.
If your application requires several ordering
types for the same document collection, then you must several rate-assigning
algorithm for each rate tag.
Technically, assigning the Document rate to
documents is setting an integer value for the <rate> element. For more information on
the Clusterpoint Server document structure, see Creating Document Structure with Application.
The relevance is a number of the integer type,
that is a measure of the accuracy of the search results, which is calculated
according to:
1) For your XML-structure your own defined
fixed relevancy weight values for parts of your custom documents (you
rank your XML data structure, telling the Clusterpoint Server which parts are
more important as other parts, when search terms hit those parts);
2) the relevancy weight interval (min..max) assigned by you for those your document
textual parts, where you want to apply classic enterprise search rules for
better textual content matching to search queries, in which the search term
appears (depends on search query matches and document actual content), where
overall relevancy is calculated based on the number of times the search term
appears compared to other documents, the distance between the search terms in
the document, if multiple words are being searched, and position of search
terms in document.
Please note that a fixed weight relevancy and
relevancy weight intervals provide relative ranking to each other, and can be
freely combined according to your own specific search needs. A document part with a
higher specific weight interval than other document parts with fixed ranking
would mean that this part is considered as more important than the other parts.
For example, the document title is more important than the document text, if
either of the two described data relevance ranking methods produce higher
overall relevancy score for a title.
In the Clusterpoint Server system, there is a common
overall relevance calculation algorithm, which is implemented according to the two
basic relevance ranking mechanisms described above in this section.
The specific weight interval can be customized
to best reflect your document structure textual parts for a good search
experience expected by your users. The
fixed weights can be used for very exact search results grouping and
positioning, for example, in paid search advertisement applications where exact
positioning is of paramount importance, or other applications. With fixed relevancies you can group your
search result set into up to 100 different by relevance groups, where search
results per each group can be additionally sorted by document rate per each
group, providing ultra-high ranked search quality for your web applications.
Fore more information on the Clusterpoint
Server relevance calculation algorithm, see Relevance Calculation Algorithm.
Fore more information on setting your own
specific weight, see Customizing Specific Weight Interval.
This section describes general principles of
the Clusterpoint Server overall relevance calculation algorithm.
Note: This section contains some of Clusterpoint
Server system implementation details. Description provided in this section is
very general and does not include implementation details for all Clusterpoint
Server functionality.
The Clusterpoint Server overall relevance
calculation algorithm consists of two parts that are performed when:
·
storing
and indexing documents to the Clusterpoint storage
·
searching
documents in the Clusterpoint storage
Steps of the Clusterpoint Server overall relevance
calculation algorithm are described generally. To ensure a better understanding
of the algorithm, an example is also provided. Each step is followed by the
example part that reflects the step.
1. When storing documents to the Clusterpoint
storage, specific weight for each word (by ‘word’ we mean all “atomic” elements
such as text words, strings, email addresses etc.) in adocument is calculated
as follows:
1.1 In each document part, the specific weight is calculated for each word according to the specific fixed relevancy weight or relevancy weight interval (min-max) of the document part the word occurs (assuming content is textual).
The specific weight for a word in a document
part is the minimum value of the following:
· minimum value of the specific weight interval of the document part plus a number of times the word occurs in the document part
· maximum value of the specific weight interval of the document part
Note: The
specific weight interval minimum and maximum can be the same fixed value. In
that case, for all words in such document part, no matter how often they
appear, the specific weight in the document part is the same: the specific
weight value of the document part. It is
then in essence transforming into simpe XML structure ranking with fixed
values.
Example:
A document
consists of three document parts: heading, description, and note. Each document
part contains words w1, w2, and w3 and has its own specific weight interval, as
described in the following figure:

Figure 12: Calculating specific weight for each document
w1(heading)=min(80,80)=80,
w1(description)=min(20+1,50)=21, w1(note)=min(10+4,12)=12
w2(heading)=0,
w2(description)=min(20+3,50)=23, w2 (note)
min(10+1,12)=11
w3(heading)=0,
w3(description)=min(20+1,50)=21, w3 (note)
min(10+2,12)=12
1.2 The maximum value of specific weights of a word in all document parts is assigned as the specific weight of the word in the document.
Example (continued):
max(w1(heading),
w1(description), w1(note))=80
max(w2(heading),
w2(description), w2(note))=23
max(w3(heading),
w3(description), w3(note))=21
When searching documents in the Clusterpoint
storage, the relevance of the document according to the search request is
calculated as follows:
2.1 Specific weights of all search terms in a document are summed.
Example (continued):
Σ(w1, w2, w3) = max(w1(heading), w1(description),
w1(note)) + max(w2(heading), w2(description),
w2(note)) + max(w3(heading), w3(description),
w3(note)) = 124
2.2 The relevance is calculated by multiplying the sum from the previous step with a value that is calculated taking into the account the distance between the search terms in the document: the greater the distance, the smaller the value
Example (continued):
Relevance =
Σ(w1, w2, w3) * d
This section describes how to set specific
weights or weight intervals for document parts that best reflects your XML document
structure (XML structure ranking).
As described in the previous section, a
specific weight or specific weight interval for a document part is an interval
between two integer numbers.
By default, the following specific weights intervals
are defined:
|
Document
part |
Minimum |
Maximum |
|
Title |
100 |
100 |
|
All except Title |
1 |
99 |
For a Title there is a fixed relevancy weight
assigned where Min and Max value is the same: 100.
For other parts a textual ranking specific
relevancy weight interval 1-99 is assigned, producing overall relevancy during
search within this specified range.
You can set a different value for the title
part, and you can define a separate specific weight interval for each document
part, such as Text and Hidden, or other document parts that you have, to ensured more detailed
relevance calculation.
Because of the performance considerations,
there is a limit for the maximum specific weight values, which is 100.
You may think about relevance ranking in
relative % of importance of your custom XML document parts, assigning higher %
to those parts, where you want search hits to order your results higher. In the above example, search hits is Title
will be at 100% relevance and always grouped upfront, but text relevancy will
be callculated according to interval-based ranking of textual content, and will
be lower positioned than matches in the Title.
In this way you can efficiently group and order
search results up to 100 groups with unique relevancies, mixig and matching
both XML-fixed structure ranking defined results relevancies and textual
content ranking relevancies.
Whenever overall calculated relevancy at search
is equal (for example, there are millions of matching documents containing a search
term in Title, all having overall identical relevancy 100% as in our example),
the results within same group relevancy is additionally ordered according to
Document rate. This is the foundation of
Clusterpoint customizable information ranking system.
This enables to uniquely rank more than 400
billion data items in your database for exact positioning and ordering for the
best relevance from the user point of view, in case of search query matches, even
if your FTS search query generates thousands and millions of results. For example, for an worldwide Internet index
if you search simply for ‘
Therefore Clusterpoint Server software
additionally sorts results according to Document rate values, for each same
resulting relevancy value group. If you Document
rate assigning algorithm, for example, calculates number of pages linking text
‘
You can design and develop your own
customizable Clusterpoint Information ranking algorithms, which can be as easy
or as complex as your business rules require.
To customize specific relevancy weight or
weigth intervals for XML document parts (XML data structure ranking with
additional enterprise search ranking for textual content) you should define the
Document policy rules for those parts assigning a fixed weight. A relevancy weight interval you sould assign
for only that textual content you would like to rank additionally using
standard enterprise search engines methods.
Both types of information ranking rules can be assigned through Document
Policy. Add your own flexible Document
rate calculation algorithm, and you would always enable to bring the most
relevant data out of your database on the first web page.
This mechanism of Clusterpoint Information Ranking
does not slow down database search, as the custom ranking is built into the
pre-sorted Clusterpoint Index during indexing and data storage phase. There is no need ot do massive sorting of
information at search. It has been
already organized for mostly sequential disk access, according to your custom
ranking rules, and search response times are always sub-second.
As soon as you store new documents in
Clusterpoint Storage, or update ones, Clusterpoint Index is updated, applying
information ranking to all index elements.
In other words, you can organize any custom Clusterpoint Index for fast
and relevant search, based on your own information rankoing algorithms and
rules, instead of relying on someone else to organize your database
information. This our indexing model is
also massively scalable in cluster (from a single server to 1000s of servers,
if necessray) and delivers sub-second search results from a large cluster
nearly without performance loss.
The existing Document policy is retrieved and a
new Document policy is defined for each Clusterpoint storage by Clusterpoint Manager
Application, when creating storages and configuring Document policy files for
them. For more information please see Clusterpoint
Server User Guide, Creating and Configuring Storages.
To customize Document rate values, you have to
develop your own algorithm to assign <rate> tag with an integer value, or
select some existing XML tag for document ranking algorithm.
Combining all three Clusterpoint database
information ranking methods described above (for XML structure, for textual
content and for documents), you can achieve the maximum search performance and
maximum user satisfaction searching your database.
They can start enjoying ultra-fast and
user-friendly Internet-style ad hoc (FTS) search that delivers the most
relevant data for web applications almost instantly, independently of the total database size and
without performance loss in clustered hardware setup.
There is also other benefits of using
Clusterpoint database information ranking to improve your productivity.
You can also configure your database for the
best search relevancy rules without any application software programming in
Clusterpoint architecture.
You can also eliminate integration with
enterprise search tools in your application software, often trying to archieve desired
relevance by ranking of search queries in application software, or building and
maintaning complex integrated SQL and related external full text search indexes
in application software. Many legacy enterprise
search systems struggle to work at high speed in clusters, they do not scale out well.
Clusterpoint Information Ranking can be
customized on Document Policy configuration file level. Document rate algorithms and relevancy rules
for a particular database, once assigned and fine tuned, would rarely need to
be changed.
Information ranking facility that is built into
the Clusterpoint Server software, also makes Clusterpoint database scalable in
clusters: Clusterpoint Server software automatically creates ranked by your own
database search rules Clusterpoint Index, distributes indexing workload among
cluser nodes, and provides index generic scalabilty for search and replication
within a cluster.
You do not need to address clustering logic in
your application software: this is another great advantage of
Clusterpoint database architecture. Form
an application software developer point of view, any Clusterpoint database will
be just a single logical database storage of XML documents, no matter on how
many cluster servers the total database content is located.
This
section describes multi-language support and character encoding concepts,
provides examples for different character encoding cases, and explains XML
formatting concepts.
This
section contains the following topics:
·
Multi-language Support and
Text Encoding
·
Formatting XML Special Characters
The Clusterpoint API structure is based on XML,
which means that all character encoding related issues adhere XML
internationalization standards.
For more information on XML
internationalization standards, see http://www.w3.org/TR/REC-xml.
Clusterpoint Server provides a complete
multi-language support, by automatically performing all necessary character
encoding conversions.
You can import documents in different languages
and encodings in a single Clusterpoint storage, as well as you can perform
search queries in different languages and encodings in a single Clusterpoint
storage.
This section contains the following topics:
·
Overview
·
Storing and Searching in a Single
Encoding
·
Storing in Different Encodings and
Searching in a Multiple Bytes per Character Encoding
·
Storing in Different Encodings
and Searching in One Byte per Character Encoding
For document importing and searching Clusterpoint
Server supports any language and text encoding. When importing documents to the
Clusterpoint storage, internally, all documents are converted and stored in the
UTF-8 encoding, as illustrated in the following figure:

Figure 13: Importing documents with different encodings
In Figure 13, document encodings are represented as encoding
values each in a white box with a double dotted border.
Data exchange between an application and the Clusterpoint
Server software is performed in the XML format. In the XML format, data can be
in any encoding; this encoding is defined in the XML header of the document.
All Clusterpoint API functions have an encoding
parameter, which defines the encoding of textual data.
This encoding is used in the XML header, when importing documents to the Clusterpoint
storage as described in Creating Document Structure with Application. The textual data must comply with
the encoding defined in a function parameter, or else the system returns a
parsing error.
The number of encodings is only limited to
those that are installed on hardware on which the Clusterpoint Server is run.
To find out what encodings are installed on the Clusterpoint Server computer,
see the Clusterpoint Server User Guide. For example, on RedHat Linux,
usually, US-ASCII, ISO8859-1..13, WINDOWS-1250..1258, UTF-7, UTF-8, UTF-16, and
UTF-32 encodings are installed.
Technically, only the encoding is important to
the Clusterpoint Server system, which means that you can store and search data
in the Clusterpoint Server system in any language as long as you supply a valid
encoding for that language.
There are the following two types of
encodings:
|
Title |
Description |
|
one byte per character |
Contains 256 characters, which means that, within one such encoding, characters for several similar languages can be included, for example, WINDOWS-1250 and ISO8853. |
|
multiple bytes per character |
Contains all UCS (universal character set) characters, which include characters for almost all languages, for example, Greek, Cyrillic, Korean, and so on. |
You can store documents in different languages
with different encodings within a single Clusterpoint storage; documents are
converted to the UTF-8 encoding, which contains all characters from UCS and,
therefore, all characters are preserved correctly.
Search results are returned in the encoding
that is used for the search request.
The following three sections contain examples
with different cases of working with a single and several encodings, which
demonstrate the Clusterpoint Server multi-language support.
This section contains an example, when a single
encoding is used for document storing and retrieving.
The following figure illustrates the example:

Figure 14: Storing and searching in single encoding
In Figure 14, document encodings are represented as encoding
values each in a white box with a double dotted border.
1) All documents are imported to the Clusterpoint
storage in the same encoding. In Figure 14, the encoding is ISO-8859-1 for French, which encodes
French character 'ç' and other characters that are not included in a
US-ASCII encoding.
2) Users submit search queries to the Clusterpoint
storage in the same encoding as the document source encoding.
3) Search results are returned to a
result set in the same encoding.
4) The search results are displayed
with correct characters to the user.
Note: The user computer must have appropriate
fonts installed for viewing that encoding. Older browser versions may not
support the UTF-8 encoding and display the special characters as question marks
?. In that case, the browser must be
updated.
This section contains an example, when
different encodings are used for document storing and a multiple bytes per
character encoding is used for retrieval.
The following figure illustrates the example:

Figure 15: Storing in different encodings and searching in multiple bytes per character encoding
In Figure 15, document encodings are represented as encoding
values each in a white box with a double dotted border.
1) Documents are imported to the Clusterpoint
storage in different encodings. In Figure 15, the encodings are ISO-8859-1 for French and
ISO-8859-15 for German.
2) Users submit search queries to the Clusterpoint
storage in a multiple bytes per character encoding, in Figure 15, the encoding is UTF-8.
3) Search results are returned to a
result set in the encoding, which is used in the search request, in Figure 15, the encoding is UTF-8.
4) The search results are displayed
with correct characters to the user.
As in
this case the multiple bytes per character encoding is used, there are no
problems for displaying characters for both languages.
Note: The
user computer must have appropriate fonts installed for viewing that encoding.
Older browser versions may not support the UTF-8 encoding and display the
special characters as question marks ?. In that case, the browser must be updated.
This section contains an example, when
different encodings are used for document storing and a one byte per character
encoding is used for retrieval.
The following figure illustrates the example:

Figure 16: Storing in different encodings and searching in one byte per character encoding
In Figure 16, document encodings are represented as encoding
values each in a white box with a double dotted border.
1) Documents are imported to the Clusterpoint
storage in different encodings. In Figure 16, the encodings are ISO-8859-1 for German and
ISO-8859-15 for German with Euro symbol '€'.
2) Users submit search queries to the Clusterpoint
storage in a one byte per character encoding, in Figure 16, the encoding is for old German codepage.
3) Search results are returned to a
result set in the encoding, which is used in the search request, in Figure 16, the encoding is for German old codepage without
Euro.
4) The search results are displayed
with correct characters to the user.
.
Characters
that are not in the encoding are returned as XML entities. For example, in Figure 16, Euro symbol '€' (if present in data) is returned as
€.
For
more information on XML entities, see http://www.w3.org/TR/REC-xml.
Note: The
user computer must have appropriate fonts installed for viewing that encoding.
Older browser versions may not support the UTF-8 encoding and display the
special characters as question marks ?. In that case, the browser must be updated.
As
mentioned in earlier sections, data are sent from an application to the Clusterpoint
storage in the XML formatting. Therefore, the data must comply with XML
formatting rules, for example, the data cannot contain XML special characters like <, >, and &, which are used for the XML markup, instead, <, >, and & must be used respectively.
Example:
If you have a title A&B, you must convert it to A&B.
For more information on the XML formatting
rules, see http://www.w3.org/TR/REC-xml.
This
section generally describes all Clusterpoint API specification, which is
implemented in XML.
This
section contains the following topics:
·
Overview
·
Clusterpoint XML Message Envelope
This section contains the following topics:
XML request and reply messages are exchanged
between the application and the Clusterpoint storage via HTTP with the port 80
as the default.
As mentioned earlier, it is possible to transport
Clusterpoint Server commands to the Clusterpoint Server and receive replies as
XML messages and, also it is possible to submit HTTP GET parameters and receive
formatted replies.
Both options are described in the following
sections:
·
Exchanging XML Messages Directly
·
Submitting Parameters and Receiving
Formatted Replies
The following figure illustrates submitting Clusterpoint
Server commands and receiving replies via XML messages directly:

Figure 17: Exchanging XML messages directly
A request is sent as a POST method.
As the HTTP resource identification, the URL http://host/cgi-bin/cpse/cpse-gw.cgi must be used, where <host> is the Clusterpoint Server host
name.
The following figure illustrates submitting Clusterpoint
Server commands as HTTP GET parameters and receiving formatted XML replies:

Figure 18: Submitting parameters and receiving formatted replies
A request is sent as a GET or POST method.
As the HTTP resource identification, the URL http://host/cgi-bin/cpse/cpse-gw.cgi must be used, where <host> is the Clusterpoint Server host
name. Command specific parameters must be included in query string or passed as
POST data.
As described previously, each XML message
contains a command name, content data that are specific for the command, and
other information, such as user name and request identifier, which is common
for all XML messages and included in the so called XML message envelope.
For more information on the XML message
envelope, see Clusterpoint XML Message Envelope.
The following figure illustrates the common
part for all XML messages and content part that is specific for each command:

Figure 19: XML message: common part and content part
Description of Clusterpoint API commands is
organized so that the common part is described in Clusterpoint
XML Message Envelope,
and only the content parts are described for each command in separate sections
named after the command.
XML elements are
presented as they appear in messages and each XML element is described within
its tags.
The command syntax consists of an XML request
and an XML reply, and as mentioned, XML requests can be submitted as HTTP GET
or POST parameters. To describe XML request, XML reply, and HTTP GET parameters
syntax, each section contains the following subsections:
|
Subsection |
Description |
|
XML Request |
Lists all XML request elements that specific for the command as they appear in XML request messages. Each element is described within its tags. The description within the tags ends with an asterisk *, if the element is mandatory. |
|
HTTP GET Parameters |
Describes HTTP GET parameter syntax in the form of an example. The example looks as follows: http://host/cgi-bin/cpse/cpse-gw.cgi?param1=value¶m2=value where: · host is Clusterpoint Server IP address or a host name, · param1, param2, and so on are HTTP GET parameters, · value is a parameter’s value. Note: In examples HTTP GET parameters are
described, however, you can submit also POST parameters. |
|
XML Reply |
Lists all XML reply elements that are specific for the command as they appear in XML reply messages. Each element is described within its tags. |
Some elements in XML requests, and thus,
respective parameters, if submitting the XML request as HTTP GET parameters,
are mandatory, and some are not. The mandatory elements are marked with an
asterisk * in the XML request description.
However, there are some XML request elements
that are mandatory only if submitted as XML request, but are not mandatory if
submitted as HTTP GET parameters. Such parameters first must be defined in the Clusterpoint
Server Web server module configuration file, and then, do not have to be
submitted each time when sending a command. Parameters that can be defined in
the Clusterpoint Server Web server module configuration file are the following:
·
user
name
·
user
password
For more information on the Clusterpoint Server
Web server module configuration file, see the Clusterpoint Server User
Guide.
This section describes the common parts of the
XML request and reply for all Clusterpoint API commands.
<?xml version=”1.0” encoding=”REQUEST-ENCODING”?>
<cpse:request xmlns:cpse=”www.clusterpoint.com”>
<cpse:storage>storage name*</cpse:storage>
<cpse:command>command name*</cpse:command>
<cpse:timestamp>message creation date and time</cpse:timestamp>
<cpse:requestid>message number</cpse:requestid>
<cpse:application>creator of message</cpse:application>
<cpse:user>user name*</cpse:user>
<cpse:password>user password*</cpse:password>
<cpse:timeout> function timeout period </cpse:timeout>
<cpse:reply_charset>reply encoding</cpse:reply_charset>
<cpse:content>command specific data </cpse:content>
</cpse:request>
<?xml version="1.0" encoding=”REPLY-ENCODING”?>
<cpse:reply xmlns:cpse=”www.clusterpoint.com”>
<cpse:storage>storage name</cpse:storage>
<cpse:timestamp>reply creation date and time</cpse:timestamp>
<cpse:content>command specific data</cpse:content>
<cpse:command>command name for which the reply is created</cpse:command>
<cpse:requestid>message number for which the reply is created</cpse:requestid>
<cpse:seconds>time period for the reply creation</cpse:seconds>
<cpse:replyid>unique message id created by the Clusterpoint Server</cpse:replyid>
</cpse:reply>
This section describes the following data
manipulation commands:
·
Delete
·
Index
·
Clear
The insert command
adds a document to the Clusterpoint storage. If a document with such ID already
exists, the command returns an error.
If a document with such ID exists in the Clusterpoint
storage, the update command updates the document. If a document with such ID is not in the Clusterpoint
storage, the update command adds it to the Clusterpoint storage.
The replace command replaces contents of a
document in the Clusterpoint storage. If a document with such ID is not in the Clusterpoint
storage, the command returns an error.
<cpse:content>
<document>document content <document>
</cpse:content>
Where the document content consists of document
structure elements. The default Clusterpoint Server document structure is as
follows:
<document>
<id> document id * </id>
<title> document title </title>
<rate> document rate </rate>
<domain> document domain </domain>
<info> meta data </info>
<text> textual information, which is used for indexing </text>
<hidden> textual information, which is used for indexing, but which is not shown</hidden>
</document>
For more information on the default Clusterpoint
Server document structure, see Creating Document Structure with
Application.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=insert&storage=test&id=1&title=Doc1
http://host/cgi-bin/cpse/cpse-gw.cgi?command=update&storage=test&id=1&title=Doc1
http://host/cgi-bin/cpse/cpse-gw.cgi?command=replace&storage=test&id=1&title=Doc1
If the command is executed successfully, the
XML reply does not contain any command specific data.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
Binary files conversion is integrated feature
for the insert, update, and replace commands. The binary files conversion
functionality converts binary file contents into plain text. Thus, it is possible
to add several Microsoft Office files and other binary files to the Clusterpoint
storage and perform full text search on them.
The following table lists extensions of binary
files that can be added to the Clusterpoint storage:
|
Extension |
Description |
|
DOC |
Microsoft Word document. |
|
XLS |
Microsoft Excel document. |
|
PPT |
Microsoft PowerPoint document. |
|
RTF |
Rich text format document. |
|
|
Adobe portable document format document. |
|
PS |
Post script document. |
To use the binary files conversion
functionality, in the XML request, in the place of the text tag, use the file tag in the following format:
<file store=’’yes/no” <!--If store=”yes”, then the original document is stored in the Clusterpoint storage and returned when retrieved. The default value is “no”-->>
<ext> extension of binary file </ext>
<data> data of binary file converted to the base64 encoding </data>
</file>
As described in the data tag, binary file contents first must be converted to the base64 encoding. This is because XML does not support storing binary data within a tag.
The delete command deletes a document from the
Clusterpoint storage. If a document with such ID is not in the Clusterpoint
storage, the command returns an error.
<cpse:content>
<document>
<id>document id *</id>
</document>
</cpse:content>
http://host/cgi-bin/cpse/cpse-gw.cgi?command=delete&storage=test&id=1
If the command is executed successfully, the
XML reply does not contain any command specific data.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
After inserting, updating, replacing, or
deleting documents in the Clusterpoint storage, the Clusterpoint Server must
permanently save the changes made to the Clusterpoint Index. The Clusterpoint
Server is able to make the decision, when to start saving the changes to the Clusterpoint
Index, on its own. However, to optimize performance, for large data amounts, it
is recommended to inform the system when a portion of documents are loaded and
in the nearest time period more documents are not to be loaded, in other words,
the Clusterpoint Server can allocate all resource for the process of indexing.
The index command tells the Clusterpoint
Server to start the process of indexing.
The <cpse:content> element does not contain any
command specific data.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=index
If the command is executed successfully, the
XML reply does not contain any command specific data.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
The clear command deletes all documents from
the Clusterpoint storage. This command should be used only when a complete
re-indexing of the Clusterpoint storage is necessary.
The <cpse:content> element does not contain any
command specific data.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=clear
If the command is executed successfully, the
XML reply does not contain any command specific data.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
This section describes the Status command.
The status command returns status information
of the Clusterpoint Server instance. The status information includes:
·
number
of documents in the Clusterpoint storage
·
number
of words in the vocabulary
·
total
number of words in the Clusterpoint storage
·
number
of executed commands since the last startup of the instance
·
number
of errors that have occurred since the last startup of the instance
The <cpse:content> element does not contain any command
specific data.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=status
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<status>
<mgr>
<started> date and time, when the Clusterpoint Server was started </started>
<age>time period the Clusterpoint Server is working since it was started</age>
<total_time_elapsed>total time spent by the Clusterpoint Server sever executing commands</total_time_elapsed>
<transactions><--This element contains information about executed
commands-->
<total> total number of commands executed</total>
<successful>number of commands that were successfully executed</successful>
<failed> number of commands that were executed unsuccessfully </failed>
<requests
command="command name">number of times the command was executed
</requests> <-- This element is repeated
for every command that was executed.-->
</transactions>
<last_modified> date and time, when modifications in Clusterpoint storage occurred last time </last_modified>
<queue> number of commands executed simultaneously </queue>
<version> Clusterpoint Server version number</version>
</mgr>
<idx>
<-- This element contains information about the Clusterpoint
Index.-->
<journal>
<usage> indexing memory cache usage in percent</usage>
</journal>
<pool_state> index state: normal, expanding, or collapsing</pool_state>
</idx>
<voc>
<-- This element contains information about the vocabulary.-->
<unique_words>unique words in the Clusterpoint storage</unique_words>
<total_words>total number of all words</total_words>
</voc>
<dat>
<documents>total number of documents</documents>
<domains> number of distinct domains of documents</domains>
</dat>
</status>
</cpse:content>
When importing data to the Clusterpoint storage:
·
If the
memory reserved for memory cache is enough for the data amount being imported,
the index state is normal.
·
If the
memory reserved for memory cache is not enough for the data amount being
imported, the index state is one of the following:
|
Title |
Description |
|
expanding |
The data being imported are written to another cache, which is written to the disk. |
|
collapsing |
When the importing is complete, the Clusterpoint Server is committing data written on the disk to the Clusterpoint storage. |
Note: While
the index state is expanding or collapsing, the data written to the disk are
not available for FTS. Only when data are added to the Clusterpoint Index, they
are available for FTS.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
This section describes the following data
retrieval commands:
·
Search
·
Select
·
Similar
The lookup command searches for a document in
the Clusterpoint storage and returns the information whether the document with
such ID exists is in the Clusterpoint storage or it does not.
The retrieve command returns a document from the
Clusterpoint storage. If a document with such ID is not in the Clusterpoint
storage, the command returns an error.
<cpse:content>
<document>
<id>document id *</id>
</document>
</cpse:content>
http://host/cgi-bin/cpse/cpse-gw.cgi?command=lookup&storage=test&id=1
http://host/cgi-bin/cpse/cpse-gw.cgi?command=retrieve&storage=test&id=1
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<found>indicator 1 or 0 if a document is found or not, respectively</found>
<results>
<document>
meta data for the lookup command
textual information for the retrieve command
</document>
</results>
</cpse:content>
For more information on policies, see Importing XML Data With Custom Structure.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
The search command performs FTS in the Clusterpoint
storage.
<cpse:content>
<query> search query *</query>
<docs> number of documents in the result set </docs>
<offset> intend from the beginning of the result set</offset>
<case_sensitive> Boolean type parameter: YES to enable case sensitivity of the first letter of words when performing the search, NO not to enable case sensitivity </case_sensitive>
<relevance> Boolean type parameter: YES to order results by relevance, NO not to order results by relevance </relevance>
<snippet> Boolean type parameter: yes (default) to display snippets and no- not to display </snippet>
<highlight> Boolean type parameter: yes (default) to do text highlighting against search query, no – not to do </highlight>
<group_size> Maximum number of documents per result group. Results from one domain are grouped together within one result page. If the parameter is not set, the default value is 0, which implies that no grouping by domains is performed and no limit is set. </group_size>
<rate_from> searching documents with in a rate range: the FROM value </rate_from>
<rate_to> searching documents with in a rate range: the TO value </rate_to>
<wildcards> <!-- This element contains parameters for configuring
wildcard patterns support. -->
<allow> Information whether the wildcard patterns search is enabled. Values “yes” or “no”.</allow>
<cover_factor> When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for to ensure a higher performance. This element defines the limit in percent from the sum of all words created from the wildcard pattern appearance in the Clusterpoint storage.</cover_factor>
<min_expand> The minimum limit of the wildcard patterns matching set from the Clusterpoint storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if only 2 words fall in the cover_factor, but the min_exapand is 4, then 4 words are being used in the search.</min_expand>
<max_expand> The maximum limit of the wildcard patterns matching set from the Clusterpoint storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if 20 words fall in the cover_factor, but the max_exapand is 16, then only 16 words are being used in the search.</max_expand>
</wildcards>
</cpse:content>
If values for the wildcards tag are not defined, corresponding parameters
set in the Clusterpoint storage configuration file are used.
For more information configuring Clusterpoint
storage, see the Clusterpoint Server User Guide.
This section contains the following topics:
·
Case Sensitivity for
Proper Names
·
Web Friendly Result
Navigation
Clusterpoint Server provides several mechanisms
for specifying your search query. Each mechanism has a definite syntax, which
is described in the following subsections. For a better understanding, each
subsection also contains an example of the mechanism described and an explanation
about what the example search query returns.
This section contains the following topics:
·
AND
·
OR
·
NOT
·
Stemming
To search for documents that contain a single
search term, the search term must be entered as is.
Example:
George returns documents that contain the
word “George”.
To search for documents that contain all of the
several terms, but which are not necessarily next to each other, the search
term must be separated by the space character.
Example:
George Brown returns documents that contain the
word “George” and the word “Brown”.
To search for documents that contain an exact
phrase, the search phrase must be enclosed in the quotations marks.
Example:
“George Brown” returns documents that contain the
exact phrase “George Brown”.
To search for documents that contain any of the
search terms, the search terms must be enclosed in { } and separated with the space
character.
Example:
{George Brown} returns documents that contain either
the word “George” or the word “Brown”.
To search for documents that do not contain the
search term, the search term must be preceded with ~.
Example:
~George returns documents that do not
contain the word “George”.
AND, OR, and NOT logical connectives
can be combined in more complex search expressions using the brackets (
), which allows
you to build any Boolean expression.
Example:
{(George Brown) (Mary Green)} returns documents that either
contains the word “George” and the word “Brown”, or the word “Mary” and the
word “Green”.
{(A B ~C) ”D E”} is parsed in the expression tree as
follows:

Figure 20: Search query expression tree
To search for documents that contain a class of
words represent:
·
exactly
one unknown character using the question mark ?
·
one
or more unknown characters using the asterisk *
·
range
of definite characters for one unknown character occurrence using the square
brackets [ ]
Note: When
wildcard patterns are used to define a class of words to be searched, only a
limited number of statistically frequent words are searched for. This
limitation is introduced to preserve the high performance of the Clusterpoint
Server. However, the maximum number of the words being searched can be
increased or decreased, when configuring the Clusterpoint Server. For more
information on configuring the Clusterpoint Server, see the Clusterpoint
Server User Guide.
Example:
ma? returns documents that contain the
word “map”, “maple”, “make”, “made”, and so on.
Geo* returns documents that contain the
word “George", “Geotermal”, “Geology”, and so on.
ma[py] returns documents that contain only
the word “map” or “may”.
c?[au]* returns documents that contain the
word “counter”, “club”, “chapter”, “country”, “change”, “chat”, “council”,
“class”, “cpu”, “challenge", “church”, “couple”, “championship”, and so
on.
By default, Clusterpoint Server indexes all common
words and characters such as “and”, ”where”, and “how”, as well as certain single
characters and single letters.
Unfortunately they tend to slow down the search without
improving the search results. Common words and characters like this are called
ignored words.
The Clusterpoint Server can be configured to
ignore common words described above to detect words that appear in the search
queries to Clusterpoint storage most often from the customer supplied ignored
words list. It is possible to edit the limit of the ignored words list.
If a common word or a character is essential to
getting the search results you want, you can include it by preceding it with a
plus sign +.
Example:
George +and Mary returns documents that contain all
three words: “George”, “and”, and “Mary”.
It is possible to include in one search request
a word and its declinations, for example, “go” and “going”.
This feature is especially useful for so-called
synthetic languages, in which syntactic relations within sentences are
expressed by the change in the form of a word that indicates distinctions of
tense, person, gender, number, mood, voice, and case, for example, German,
Russian and Latin.
To search for documents that contain words in
declinations, a word or a phrase must be enclosed in the dollar signs $
Example:
$George$ returns documents that contain the
word “George” and “George’s”.
To search for documents that contain the search
term in a specific tag, the search term must be enclosed in the appropriate
tags.
Note: The
searching within markup can be performed only if the document policy rule
property index with values xml or all is used. For the default document structure
the index policy rule with the value xml is set by default. For more
information on defining storage policy see, Importing XML Data With Custom Structure.
Example:
<person>George
Brown</person> returns documents that contain the word “John” in the <person> tag and the word “Smith” in the <person> tag.
{<person>George</person>
<address>”
It is possible to define maximum of words,
which appear between certain search terms. These search terms are also defined
in the search query. Such feature is called proximity search.
To use the proximity search feature, the search
term must be as follows:
@ N term1 term2 @,
where N is the maximum count of words between the search terms, and term1 and term2 are search terms. There can be any number of search terms included in
the proximity search.
If N is 1, then the search is exactly the same as if the phrase search was
used.
For more information on the phrase search, see Phrase
Search.
Example:
@ 4 phone fax @ returns documents that contain the
words “phone” and “fax” not further than 4 words from each other.
Due to the fact that the Clusterpoint Server is
indexing not only text information, but it also indexes numeric information, it
is possible to perform numeric search. Numeric search allows searching
documents that contain numeric values within a numeric interval.
For example, each document contains information
about an object including geographic coordinate information. In that case, the
numeric search can be performed to retrieve all objects in definite range of
geographic coordinate. Thus, Clusterpoint Server can be used in online maps,
where people can find information on different objects in a definite area.
The numeric search can be performed only
together with a text search.
Numeric values in documents are indexed and
stored as floating points, no matter if they are integers or floating points in
original documents.
Fraction part is stored up to the sixth digits.
To use the numeric search functionality, the
search term must be as follows:
·
To
perform numeric search within a range of two numeric values, enter _textual
search term_X .. Y, where X is the minimum value of the search numeric value,
and Y is the maximum.
·
To
perform numeric search for a document that contain numeric value greater than
the given, enter _textual search term_>X.
·
To
perform numeric earch for a document that contain numeric value smaller than
the given, enter _textual search term_<X.
It does not matter if textual search term is
entered before or after numeric search term.
Example:
Document content:
<document>
<id>76541</id>
<title>George’s profile</title>
<text>
<name>George Brown</name>
<age>26</age>
</text>
</document>
Search query that matches the document:
<query>
<name>George</name> <age>20.. 30</age>
</query>
<numeric_ordering>center</numeric_ordering>
Note: For performing numeric searching for one
document tag, as in the previous example the <age> tag, only one numeric
interval can be entered. If you enter more than one numeric interval for one
tag, then nothing is returned since numeric intervals are joined with the AND
logical operation.
For information on performing numeric search
for more than tags, see Numeric Search in More Than One Tag.
The <numeric_ordering> tag in the example denotes the
order in which search results must be returned.
Possible values for numeric ordering are the
following:
|
Title |
Description |
|
none |
No numeric ordering is applied. |
|
center |
Results that are closer to the mean value of the numeric search interval are returned first. This value is allowed only for numeric search within a range of two numeric values. |
|
ascending |
Numeric search results are returned in ascending order. |
|
descending |
Numeric search results are returned in descending order. |
It is possible to perform numeric search in more than one tag. It means that for each tag that contains numeric information a numeric search range can be performed.
Example:
Document content:
<document>
<id>76541</id>
<title>George’s profile</title>
<text>
<name>George Brown</name>
<age>26</age>
<children>4</children>
</text>
</document>
Search query that matches the document:
<query>
<name>George</name> <age>20 .. 30</age> <children>> 3</children>
</query>
<numeric_ordering>center</numeric_ordering>
Numeric search in more than one tag is
especially useful and necessary for geographic coordinate searching, where it
is necessary to search for an object by its longitude and latitude.
Note that '>' symbol in the example above should be
denoted as '>' according to XML encoding standard to avoid XML parsing
errors.
For numeric search in more than one tag result
ordering is combined in one for all tags.
The following table describes result ordering
is combined:
|
Ordering type |
Description |
|
ascending |
Results are ordered ascending by the sum of all numeric values from tags in which the numeric search is performed. |
|
descending |
Results are ordered descending by the sum of all numeric values from tags in which the numeric search is performed. |
|
center |
Ordered by shortest distance to the center of intervals in multi-dimensional space where each dimension represents a tag in which the numeric search is performed. Distance to the center of intervals in multi-dimensional space is calculated by the following formula: (x-xc)/xr*(x-xc)/xr + (y-yc)/yr*(y-yc)/yr + …+ (z-zc)/zr*(z-zc)zr, where x, y, z are numeric search intervals xc, yc, zc are centers of each interval, respectively xr, yr, zr are half of numeric interval range, respectively. |
Numeric search functionality in several tags or
in several dimensions has additional feature that allows returning numeric
search results that match:
·
a hypercube
of all numeric intervals, which is default, or
·
only a
hypersphere of all numeric intervals.
For example, if geographic coordinates of ATMs
in a city are indexed, it is possible to search for an ATM that is not farther
than 1 kilometer from a definite location. That is, you need to retrieve only
those ATMs that match the circle (a hypersphere with 2 dimensions in this case)
with a radius of 1 kilometer.
If in the previous example, the default numeric
search is performed, results that match a square with the side length 2
kilometers are returned. This means that also ATMs that are square root of 2,
which is approximately 1.41, are returned.
As said before, the default value for the multi
dimensional shape feature is a hypercube. Value for the multi dimensional shape
feature is defined in the <md_shape> tag, which is included in the Clusterpoint Server command syntax.
Possible values for the <md_shape> tag
are the following:
|
Ordering type |
Description |
|
cube |
Results that match a hypercube are returned. |
|
sphere |
Results that match a hypersphere are returned. |
Example:
Document content:
<document>
<id>76543</id>
<title>Gas station</title>
<text>
<name>Springtown</name>
<x>3.2</x>
<y>5.7</y>
</text>
</document>
Search query that matches the document and finds gas stations within the
distance of 1 kilometer from point (4.0, 6.0):
<query>
<name>gas</name> <x>3.0 .. 5.0</x> <y>5.0 .. 7.0</y>
</query>
<numeric_ordering>center</numeric_ordering>
<md_shape>sphere</md_shape>
It is possible to perform case sensitive search
for proper names, which means that case sensitivity is applied for the first
letter of a search term.
The case sensitivity feature is switched on or
off by setting the <case_sensitive> parameter in the search command’s XML request.
For more information on the search command’s XML request, see XML
Request.
Example:
If the <case_sensitive> parameter is set to YES, and the
search query contains “Bank”, then the search command returns documents, in
which the word “Bank” is with the first capital. Note that in this case, also
documents, in which the word “BANK” is with all capitals, are returned, since
the case sensitivity is applied only to the first letter of a search term.
It is possible to set the maximum number of
documents in a search result that are returned form one group specified by a
grouping tag. If this feature is used, in the search result, documents from one
group tag value are grouped together within one result page.
You can define which XML tags use for grouping
in Document policy with ‘group=yes’ policy.
The grouping results by selected group feature
is defined by setting the <group_size> parameter in the search command’s XML request larger than
0.
If the parameter is not set, the default value
is 0, which implies that no grouping by group tag is performed and no limit is
set.
<cpse:content>
...
<group> tag name of tag for which
groups were created </group>
<group_size> number of documents returned
from one group (default 0 - no grouping performed) </group_size>
...
</cpse:content>
For more information on the search command’s XML request, see XML
Request.
It is possible to filter search results by
document rate by setting the minimum and maximum of the rate range within which
the rate of a document must be to appear in the search result.
Document rate is of the integer type. However,
it is possible to convert any date and time into integer using the UNIX
timestamp, which converts a date and time into amount of seconds from
01/01/1970 till the given date and time. Thus, it is possible to set date and
time as document rate and to search for document within a certain time
interval.
The filtering results by rate feature is
defined by setting the <rate_from> and <rate_to> parameters in the search command’s XML request.
For more information on the search command’s XML request, see XML
Request.
Clusterpoint Server is designed for use in Web
and Intranet applications in mind. In many cases to display results in Web, the
paging functionality is used. The paging functionality implies that the search
result records are divided in parts, where each part is displayed in its own
page, and each part contains a fixed amount of records.
The Web friendly result navigation feature is
defined by setting the <docs> and <offset> parameters in the search command’s XML request.
For more information on the search command’s XML request, see XML
Request.
Example:
If the <docs> parameter is set to 10, and the <offset> parameter is set to 30, the search
command returns results from 30 till 39.
Faceted search, sometimes called also an XML drilldown, is feature that allows to return search results with additional information, called facets, that are grouping documents into hierarchical structure by category, theme etc. togeher with search hits per category, and then can be used for search narrowing or expansion in this ‘categorized’ structure without reentry of search query. Using this feature, you can create very useful extra navigation interfaces for navigating dynamically generated per each query catalogues, directories, themes, taxonomies and much more.
Setting ‘index=facet’ policy
rule
This Document policy rule should be first set for those tags, e.g., ‘category’, for which facets should be generated.
<rule>
<xpath>//document/category</xpath>
<property>index=facet</property>
</rule>
<rule>
<xpath>//document/author</xpath>
<property>index=facet</property>
</rule>
Policy rule for this feature should be set before indexing the data in order to use this Faceted search (XML Drilldown) functionality. Facets are created on index level only, to enable fast access without slow sorting and counting per query.
Document import
Once you have set Document policy defining which tags you want to be treated as facets at search, you can store and index documents into storage.
<?xml version="1.0" encoding="utf-8"?>
<cpse:request xmlns:cps="www.clusterpoint.com">
<cpse:storage>news</cpse:storage>
<cpse:command>insert</cpse:command>
<cpse:user>root</cpse:user>
<cpse:password>password</cpse:password>
<cpse:content>
<document>
<id>news_1</id> <!-- article id, primary key -->
<title>Lorem ipsum dolor sit amet</title>
<teaser>Nam neque metus, pulvinar a dapibus in, mattis et neque. Duis sollicitudin ultricies nisl, ut tristique justo gravida eget. </teaser>
<body>Suspendisse euismod porta suscipit. Donec non lorem ut sem varius sollicitudin at non risus. Curabitur nec tellus vitae nunc porttitor
ultrices quis vulputate magna. Nulla feugiat, nisl ut tristique dapibus, lacus sem pharetra leo, eu vestibulum arcu diam vel ante. Suspendisse dolor mauris,
pellentesque non mollis eu, accumsan quis nisi. Phasellus nec nulla eget eros aliquet fringilla ac eget enim.</body>
<published>01/01/2011 13:45:33</published>
<category>News
<subcategory>Bussiness</subcategory>
</category>
<author>Alice B</author>
</document>
<document>
<id>news_2</id> <!-- article id, primary key -->
<title>Mauris at odio eget neque pellentesque</title>
<teaser>Etiam lobortis, diam non tincidunt scelerisque, tellus augue.</teaser>
<body>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi consectetur tristique lectus, dictum vestibulum magna mattis in.
Vestibulum tellus metus, interdum eget aliquam ac, feugiat dapibus felis. Nulla molestie fringilla elit, ac faucibus leo fringilla in.</body>
<published>28/01/2010 8:12:55</published>
<category>Sports
<subcategory>Tennis</subcategory>
</category>
<author>Jonathan C</author>
</document>
</cpse:content>
</cpse:request>
Note that subtags of category such as ‘subcategory’ also can be included to enable multi-level hierarchial faceted search results return that can be used for navigation.
Facets generation per search
query
Within ‘search’ request content tag supply a tag ‘facet’, which identifies XPath location relative to the tag for which index facet policy rule is defined. Note that this request will generate all facets found matching the actual search query, along with number of hits found per each facet.
Example request (multi-level level
faceted search drilldown sample):
<?xml
version="1.0" encoding="utf-8"?>
<cpse:request
xmlns:cps="www.clusterpoint.com">
<cpse:storage>news</cpse:storage>
<cpse:command>search</cpse:command>
<cpse:user>root</cpse:user>
<cpse:password>password</cpse:password>
<cpse:content>
<query>od*</query>
<docs>10</docs>
<offset>0</offset>
<facet>category</facet> <!--
will retrieve all top categories -->
<facet>category=News/subcategory</facet>
<!-- will retrieve all subcategories for "News" category -->
<facet>author</facet>
<!-- will retrieve all authors -->
<list>
<id>yes</id>
<teaser>highlight</teaser>
<body>snippet</body>
</list>
</cpse:content>
</cpse:request>
Example response (single level
drilldown)::
<cpse:content>
…
<facet path="category">
<term hits="25">News</term>
<term hits="1">Comments</term>
<term hits="7">Questions</term>
<term hits="7">Replies</term>
</facet>
…
</cpse:content>
For multi level drilldown, simply pass correct deeper XPath location. Be sure to add “=<selected value>” to each parent category or you will receive invalid hits.
Example request (multi level
drilldown):
<cpse:content>
…
<facet path="category=News/subcategory">
<term hits="10">Sports</term>
<term hits="1">Business</term>
<term hits="3">Politics</term>
</facet>
…
</cpse:content>
You can combine as many facets as you need per
each search query, however, please note that too many different facets can
negatively affect performance for very large databases.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=search&storage=test&query=George
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<problems><!-- This element only appears if some search terms were ignored. -->
limpat/term (limited patter coverage) or patign/term (frequent pattern - ignored) or worign/term (frequent word - ignored)
multiple ignored terms appear as problem/term problem/term problem/term
</problems>
<ignored>common words that are ignored when performing the search</ ignored >
<realquery>real query that was used to perform the search, including the derived words from the wildcard usage and dropped ignored words</ realquery>
<found> number of documents found </found>
<hits> approximate total amount of results that match the search query </hits>
<more> number that indicates how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and an at least number if in the form of >N</more>
<from> documents in the result set within a numerical range: the FROM value </from>
<to> documents in the result set within a numerical range: the TO value </to>
<results>
<document> meta data of the document found </document> <!-- This element is repeated for all documents found.
-->
</results>
</cpse:content>
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
The select command searches for document by
their identifiers. It is possible to select one document by a precisely entered
document identifier or to use wildcard pattern to select all documents that
identifiers match the wildcard pattern entered. For example, if only the
asterisk * is entered, identifiers for all
document in the Clusterpoint storage will be returned.
The default number of document identifiers
returned to result set is 1024, but this number can be changed by entering a
different number in the <docs> tag.
<cpse:content>
<document>
<id>document id *</id>
</document>
<docs> number of document identifiers in the result set </docs>
<offset> intend from the beginning of the result set</offset>
</cpse:content>
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<found> number of document identifiers matched </found>
<from> document identifiers in the result set within a numerical range: the FROM value </from>
<to> document identifiers in the result set within a numerical range: the TO value </to>
<results>
<id> meta data of the document found </id> <!-- This element is repeated for all document
identifiers matched. -->
</results>
</cpse:content>
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
The similar command searches for similar
documents in the Clusterpoint storage to a textual information, which is given
directly, or which is contained by a document. The textual information, to
which similar documents are searched for, is also referred as the input text.
The algorithm that is searching for similar
documents uses statistical information about the number of times words
contained by the input text, or so called keywords, appear in documents and
finds similar documents to the input text fragment or document with a given ID.
You must take into account that the algorithm
uses statistical information about words and does not know their meaning.
Therefore, similar documents might not be semantically alike, however, when
working with large text collections that contain medium large documents, shows
that the algorithm works fine.
<cpse:content>
<id> document id to which similar documents must be searched for ** </id>
<text> textual information to which similar documents must be searched for ** </text>
<len> number of keywords in the input text * </len>
<quota> minimal amount of keywords that must be found in documents, which are returned the search result *</quota>
<docs> number of documents to be retuned in the result set </docs>
<offset> intend from the beginning of the result set </offset>
</cpse:content>
For large text collections in the Clusterpoint
storage (> 1 million documents), best practice shows that the len element equal to 20 and the quota element equal to 4 gives the best results.
However, you can experiment to find the best values for your specific text
collection.
The two asterisks ** means that only one from the two
elements must be entered, in other means, the relationship between these two
elements is XOR.
When developing Web applications, please take
into account performance implications when parameters len and the quota are not restricted in values. Usually it is not reccommended to allow
end-users to change those value set by application, otherwise some queries can
take much longer time than normally - even tens of seconds per query. For high volume high usage Web sites this
search functionality should be used with care and those parameters always fine
tuned for specific data storage by the data storage owner.
Due to some internal limitations in API, one
command cannot be used with XOR type parameters. So two commands are used.
Command ‘similar-id’ is used, when search is performed against document ID and
‘similar-text’ – when searching for similar text. Command ‘similar’ is provided
for backward compatibility only and acts as ‘similar-id’.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=similar-id&storage=test&id=Doc1&len=20 "a=4
http://host/cgi-bin/cpse/cpse-gw.cgi?command=similar-text&storage=test&text=George&len=20 "a=4
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<found> number of documents found </found>
<hits> approximate total amount of results that match the search query </hits>
<more> number that indicates, how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and the minimum number if in the form of >N</more>
<from> documents in the result set within a numerical range: the FROM value </from>
<to> documents in the result set within a numerical range: the TO value </to>
<results>
<document> meta data of the document found </document> <!-- This element is repeated for all documents found.
-->
</results>
</cpse:content>
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
If the alternatives search is performed, the system
returns a set of alternative words from the Clusterpoint storage vocabulary,
which are similar in spelling or has a different language declination, for
example, if you enter ”bote”, then ”bite” and “byte” are offered for searching. Note
that only actual words from the Clusterpoint storage are returned that has been
indexed by Clusterpoint Server. Words
not present in any of the indexed documents, are not available in Clusterpoint
Server Vocabulary and therefore can not be returned by this command.
This feature can be used for fuzzy searches and
for spelling error corrections.
Alternative words are returned from the
vocabulary, which ensures that the alternative words are actual words that are
in imported to the Clusterpoint storage.
When searching alternative words, the alternatives command considers the statistical
information about the occurrence of the alternative word in the vocabulary, and
the similarity of the alternative word to the search term. In other words,
alternatives that occur in the Clusterpoint storage more often and that are
more similar to the search term are returned.
<cpse:content>
<query> search query * </query>
<cr> Minimum ratio to include the alternative in the search
query between the occurrence of the alternative and the occurrence of the
search term. If you increase this parameter, there are less number of results
returned to the result set, however performance is improved.</cr>
<idif> Maximum number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference. If you increase this parameter, there are greater number of results returned to the result set, however performance is reduced.</idif>
<h> Minimum number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.<h>
</cpse:content>
If values for the cr, idif, or h tags are not defined, corresponding parameters set in the Clusterpoint
storage configuration file are used.
For more information configuring Clusterpoint
storage, see the Clusterpoint Server User Guide.
http://host/cgi-bin/cpse/cpse-gw.cgi?command=alternatives&storage=test&query=George
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<alternatives_list>
<alternatives>
<to> alternative search term </to>
<count> number of times the alternative search term occurs in the Clusterpoint storage</count>
<word
count=”number of times the alternative occurs in the
Clusterpoint storage” cr=”ratio between the occurrence of the alternative and the
occurrence of the search term” idif=”number that indicates how much does the alternative
differs from the search term, the greater the idif value, the greater the
difference” h=”number that gives an overall
estimation of the quality of the alternative, the greater the cr value and the
smaller the idif value, the grater the h value”> alternative
</word><!-- This element is repeated for
each alternative word.-->
</alternatives><!-- This element is repeated for each search
term.-->
</alternatives_list>
</cpse:content>
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
The list-last command searched for documents in
the Clusterpoint storage that most recently have been inserted, updated, or
replaced, using the insert, update, or replace commands, respectively.
<cpse:content>
<docs> number of documents in the result set </docs>
<offset> intend from the beginning of the result set</offset>
</cpse:content>
http://host/cgi-bin/cpse/cpse-gw.cgi?command=list-last&storage=test&docs=10&offset=100
If the command is executed successfully, the
XML reply contains the following command specific data.
<cpse:content>
<found>number of documents returned to the result set</found>
<results>
<document>
meta data for the list-last command</document><!-- This element is repeated for all documents found.
-->
</results>
</cpse:content>
For more information on how to define document
storage policy, see Importing XML Data With Custom
Structure.
If the command is not executed successfully, an
error is returned. For more information on errors, see Error
Handling.
Clusterpoint
Server allows user to implement and use additional filtering functionality in
user application for a specific storage using API command set for establishing,
removing and activating context triggers.
Please note that Clusterpoint Server do not
activate any filtering context trigger event on its own. You should always programm this functionality
into your own application using API commands set provided in this section.
Remember that you are responsible to check for
context matching events and decide when to activate scripts for examining
context trigger matches for stored documents.
Context triggers are defined as search queries
that can be performed against storage inside server internally, and some event
script (usually sheel script performed).
Special API command must be used to activate actual document check against
a list of defined filters - Clusterpoint API command EXAMINE.
Context trigger mechanism controlled by
application gives user application even more flexibility in alert handling, for
example, to avoid starting automatic triggerring of events after data reindexes
which can cause massive alert event generation such as sending out email
messages etc.
Alerting API commands are sent to server using
standard Clusterpoint
XML messaging.
This command add context trigger with supplied
ID to the filtering list of all context triggers activated for that document
storage.
For one document storage there can be only one
filtering list of context triggers.
There can be hundreds and thosuands of context
triggers in the filtering list.
Each added context trigger will match documents
against query supplied in filter tag.
You should keep trigger IDs known to your application too, as this is
the only reference of individual context triggers in the filter list of
storage. Those IDs are being returned
when executing triger events with API command EXAMINE for specific document
(see below).
< CPSSPC:command>add_trigger</ CPSSPC:command>
< CPSSPC:content>
<id>Trigger
id</id>
<filter>Trigger
filter query</filter>
<recipient>Recipient
of notification</recipient>
</ CPSSPC:content>
Please note, that storage document ID is not
the same 'id' used for trigger definition.
This command removes specific context trigger
from the filtering list of all supplied triggers.
< CPSSPC:command>remove_trigger</ CPSSPC:command>
< CPSSPC:content>
<id>Trigger id</id>
</ CPSSPC:content>
This commnad clears all context triggers for
the specific storage and filtering list becomes empty.
< CPSSPC:command>clear_triggers</ CPSSPC:command>
This command examines an existing document with
specific ID against all context triggers found in the filtering list. If notify parameter is set to yes a shell script is executed for each context trigger that
matches document.
Also in reply to this command list of context
triggers-id’s that matched this document is returned.
This list of context triggers IDs can be
processed by user application to develop advanced content monitoring and
alerting business applications.
< CPSSPC:command>examine</ CPSSPC:command>
< CPSSPC:content>
<document>
<id>document id to examine</id>
</document>
<notify>yes/no – to send message or not</notify>
</ CPSSPC:content>
Storage configuration has to be set up to
specify shell script that will be executed, when trigger is matched against
document, if this functionality is requested by EXAMINE command.
This can be done using Clusterpoint Manager
tool Storage Configuration option, like:
<config>
<alerts>
<action>Shell script to execute</action>
</alerts>
</config>
Shell script can be any alert activity such as
sending email message, or writing log file, or updating customer notification
database.
If a command sent to the Clusterpoint Server is
not executed successfully, an error is returned in the following XML reply
message:
<?xml version="1.0" encoding="REPLY-ENCODING"?>
<cpse:reply xmlns:cps="www.clusterpoint.com">
<cpse:storage> Storage name </cpse:storage>
<cpse:timestamp> reply creation date and time </cpse:timestamp>
<cpse:command> command name for which the reply is created </cpse:command>
<cpse:requestid> message request id for which the reply is created </cpse:requestid>
<cpse:content> command specific data </cpse:content>
<cpse:seconds> time spent for the reply creation </cpse:seconds>
<cpse:error>
<code> error code (4 digits) </code>
<text> error textual message </text>
<level> error severity </level>
<source> subsystem in which the error occurred </source>
<document_id> document_id that the error refers to - for some errors </document_id>
</cpse:error>
</cpse:reply>
The error severity can be one of the following:
Error severity can be one of the following:
|
Type |
Description |
|
DEBUG |
Debug
information, can be switched on/off by Storage configuration directive
'/config/debug'. |
|
NOTIFICATION
|
Information
that may be useful. No action is necessary. |
|
INFORMATION
|
Information
that is useful. This type of error should be noted, though no action is
necessary. |
|
WARNING
|
Returned
when the command has been executed successfully, but some problem indications
exist. |
|
REJECTED
|
Returned
when the the input data are incorrect, command has been not executed. |
|
FAILED |
Returned
when temporary system problem occured, command has been not executed. |
|
ERROR |
Returned
when system internal error occured, command has been not executed. |
|
FATAL |
Returned
when Clusterpoint Server is not functioning. |
The purpose of the error severity is to inform
that:
·
If
the error severity is fatal or error, inform the system administrator and stop
working with the Clusterpoint Server.
·
If
the error severity is failed, rejected or warning, the errors are logged and
can analyzed, while work can be continued.
When checking for errors in an automated
script, always use the error code or error level as the reference point, not
the error message text, as the text can vary, while the error code always stays
the same.
Full list of possible error codes is available ERROR
MESSAGES.
When reporting errors to technical support,
please provide full information - the entire request and reply, Clusterpoint
Server version, OS as well as any other useful information that can help to
resolve problem.
Clusterpoint Server is a transaction-based
system, which means that commands has a predefined timeout period. If a command
is not executed during this predefined timeout period, the command returns the
error.
It is possible to define a timeout period for
the request, or configure it for the Clusterpoint Server.
For more information on configuring timeout
periods for the Clusterpoint Server, see the Clusterpoint Storage
Configuration File.
For more detailed error references, description
of most commonly encountered errors, error code groupings by problem areas, and
the complete list of all error codes please see APPENDIX A.
This
section describes Clusterpoint Server clustering technology and provides general
steps for working with it. Using the Clusterpoint Server clustering technology,
several Clusterpoint Servers can be joined into one cluster, which enables that
search can be performed in a text collection of an unlimited size.
This
section contains the following topics:
·
Creating Clusterpoint Server
Cluster
A Clusterpoint Server cluster consists of
nodes. Each node is a computer, on which the Clusterpoint Server is installed.
The Clusterpoint Server cluster technology has
a transparent architecture, which implies that each node is fully functional on
its own. Access to all cluster nodes or
to a single cluster node is very simple and described below.
Clustering is generic feature of Clusterpoint
software. You can create clusters of
multiple hardware servers for storing and searching in very large data sets -
hundreds of gigabytes and terabytes. You
can benefit both from workload and data sharing among multiple hardware, and
from resilience against failures if you keep multiple mirrored copies.
Clustering API command set was made transparent
to application developer by default.
To create a Clusterpoint Server cluster
storage, proceed as follows:
1) Let us assume that we have a very
large text collection that is too big to be stored and worked with on single
computer (we treat it as as node no. 1). In that case, estimate in how many
equal in size parts the text collection can be divided so that each part can be
stored and worked with on single computer.
2) On the number of computers estimated
in the previous step, install the second Clusterpoint Server in your network
(as node no. 2).
3) Selecting option Cluster Storages in
the Clusterpoint Manager management tool, create a clustered Clusterpoint
storage on selected hadrware servers (nodes node1 and node2 in our sample use
case), that is visible as Clusterpoint Server hardware in your network.
4) Create an application, which imports
each part of the text collection to its own node, addressing it through
Clusterpoint gateway URL:
http://node1.domain.com/cgi-bin/cpse/cpse-gw.cgi?command=insert&storage=test&id=1&title=Doc1
http://node2.domain.com/cgi-bin/cpse/cpse-gw.cgi?command=insert&storage=test&id=1&title=Doc2
When you have created a Clusterpoint Server
cluster storage, you can either:
·
Connect
to any of the Clusterpoint Server cluster’s nodes with your Clusterpoint API
request as it would be single storage command.
The node transparently to your application connects to all other
hardware nodes in the Clusterpoint Server cluster and thus a unified result of
a search query in the whole text collection is created. The same is true for indexing API
comamnds. If mirrored cluster storage
was created, then all updates will be identically indexed on all cluster
hardware nodes. If striped cluster
storage configuration was created, then the hardware node with the least
workload will be selected for indexing of the individual document (automatic
load balancing is used to determine where to place individual document in the
clustered storage).
·
Connect
to only one Clusterpoint Server cluster node and, thus, perform a search only
within a part of the whole text collection, by specifying <cpse:type>single</cpse:type> in Clusterpoint XML request message
envelope and connecting directly to the gateway module IP address of that
specific Clusterpoint Server. Then API
command will be performed only for that specific node data set (or 1/Nth of
total database content).
·
In
this way you can selectively index or search part of the data depending on your
application logic and specific business application data distribution needs.
This optional mechanism
to directly address cluster nodes according to your own application logic,
without a single point of failure, excellently works in moder private cloud
computing environments.
For example, you
can set up a pool of Clusterpoint database mirror servers (all running the same
mirrored database) to be randomly queried by your application, implementing
nearly linear load sharing. Or you can
address specific cluster nodes to store and search data, without initiating
cluster-wide operations.
It is also
high-performance networking oriented solution, avoiding unnecessary data
transfers through some strictly specified gateway software, creating a single
point of failure for your application and performance trade-off.
This
section contains sample applications or use cases written for the Clusterpoint
Server system. Each use case is in a separate section and contains a short
description and source code.
This
section contains the following use cases:
·
Use Case in C: Importing Text Files
·
Use Case in Perl: Importing Text Files
·
Use Case in PHP: Searching Clusterpoint
storage and Returning Results in HTML
·
Use Case in ASP: Searching Clusterpoint
storage and Returning Results in HTML
·
Use Case in Java: Searching Clusterpoint
storage from applet
This application is implemented in the C
programming language.
The application reads files from the file
system and imports them to the Clusterpoint storage. The application is
receiving file names as command line arguments.
It also detects whether the file is a text file
or a binary file by counting whitespaces in them: if a file contains relatively
less whitespaces, it is considered to be a binary file, and if a file contains
relatively more whitespaces, it is considered to be a text file.
* This application is implemented in the C
programming language.
*
* The application reads files from the file
system and imports them to the
* Clusterpoint storage via HTTP POST interface
using libcurl.
*
* The application receives file names as
command line arguments.
*
* It also detects whether the file is a text
file or a binary file by counting
* whitespaces in it: if a file contains
relatively less whitespaces, it is
* considered to be a binary file, and if a
file contains relatively more
* whitespaces, it is considered to be a text
file.
*/
/* include standard headers */
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <ctype.h>
/* libcurl */
#include <curl/curl.h>
/* connection parameters */
char *url = "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi";
char *storage = "test";
char *user = "name";
char *passwd = "pass";
char *encoding = "US-ASCII";
char *post_fmt = "storage=%s&command=insert&user=%s&password=%s&id=%s&title=%s&rate=%d&text=%s&encoding=%s";
#define REQUIRED_WHITESPACE_FRACTION 0.12
typedef struct { int len, used; char *buf; } curl_reply;
size_t read_reply(void *buffer, size_t size, size_t nmemb, void *userp)
{
int new_len;
curl_reply *r = (curl_reply *) userp;
for (new_len = r->len; new_len < r->used + size * nmemb + 1; new_len *= 2);
if (new_len > r->len) r->buf = realloc(r->buf, new_len);
memcpy(r->buf + r->used, buffer, size * nmemb);
r->len = new_len;
r->used += size * nmemb;
r->buf[r->used] = '\0';
return size * nmemb;
}
int main(int argc, char *argv[])
{
CURL *curl_handle;
char *storage_esc, *user_esc, *passwd_esc, *title_esc, *text_esc, *encoding_esc;
curl_reply reply;
char *err_buf[CURL_ERROR_SIZE];
int i;
if (argc == 1) {
printf("Usage: [-r url] [-s storage] [-u user] [-p password] [-e encoding] files\n");
return 0;
}
/*
read options */
for (i = 1; i < argc; i++) {
if (argv[i][0] == '-') {
if (i + 1 >= argc) break; /* no option value */
switch(argv[i][1]) {
case 'r':
url = argv[i+1];
break;
case 's':
storage = argv[i+1];
break;
case 'u':
user = argv[i+1];
break;
case 'p':
passwd = argv[i+1];
break;
case 'e':
encoding = argv[i+1];
break;
default:
printf("Unknown option: %s\n", argv[i]);
break;
}
i++;
}
}
/*
initialization */
curl_global_init(CURL_GLOBAL_ALL);
curl_handle = curl_easy_init();
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, read_reply);
curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, &reply);
curl_easy_setopt(curl_handle, CURLOPT_ERRORBUFFER, err_buf);
storage_esc = curl_escape(storage, 0);
user_esc = curl_escape(user, 0);
passwd_esc = curl_escape(passwd, 0);
encoding_esc = curl_escape(encoding, 0);
/*
names of files to be imported are passed as arguments
* process each of them */
for (i = 1; i < argc; i++) {
FILE *f;
struct stat st;
int k, spaces;
char *buf, *post_data;
/*
check if argument is option */
if (argv[i][0] == '-') {
if (i + 1 >= argc) break; /* no option value */
i++;
continue;
}
printf("Reading file: '%s'\n", argv[i]);
/*
open file */
f = fopen(argv[i], "r");
if (!f) {
fprintf(stderr, "Couldn't open file '%s'\n", argv[i]);
continue;
}
/*
retrieve file information */
if (fstat(fileno(f), &st) < 0) {
fprintf(stderr, "Filesystem error retrieving info on '%s'\n", argv[i]);
fclose(f);
continue;
}
if (!S_ISREG(st.st_mode)) {
fprintf(stderr, "File '%s' is not regular file\n", argv[i]);
fclose(f);
continue;
}
printf("\tSize: %d bytes\n", st.st_size);
/*
read all of it into memory */
/*
note: this sample program asumes all of file fits into memory
* so if you need to work with larger files
figure out something else */
buf = (char *) malloc(st.st_size + 1);
if (!buf) {
fprintf(stderr, "Memory allocation failed\n");
}
k = fread(buf, 1, st.st_size, f);
fclose(f);
if (k < st.st_size) {
fprintf(stderr, "Error reading file\n");
free(buf);
continue;
}
buf[k] = '\0';
/*
see if it is text file
* estimate that by counting white space:
* natural language text in contrary to binary
data
* must contain significant portion of
whitespace */
spaces = 0;
for (k = 0; k < st.st_size; k++) {
if (isspace(buf[k])) spaces++;
}
if (spaces < st.st_size * REQUIRED_WHITESPACE_FRACTION) {
printf("\tBinary file: ignored\n");
free(buf);
continue;
}
/*
execute Clusterpoint Server insert command through HTTP POST interface */
title_esc = curl_escape(argv[i], 0);
text_esc = curl_escape(buf, k);
post_data = malloc(strlen(storage_esc) + strlen(user_esc) + strlen(passwd_esc) + strlen(post_fmt) + 2 * strlen(title_esc) + 20 + strlen(text_esc) + strlen(encoding_esc));
sprintf(post_data, post_fmt, storage_esc, user_esc, passwd_esc, title_esc, title_esc, 100, text_esc, encoding_esc);
curl_easy_setopt(curl_handle, CURLOPT_POSTFIELDS, post_data);
reply.buf = malloc(reply.len = 1);
reply.used = 0;
if (curl_easy_perform(curl_handle) != CURLE_OK) {
fprintf(stderr, "Error connecting to Clusterpoint Server: %s\n", err_buf);
} else if (strstr(reply.buf, "<cpse:error>")) { /* simplified error check */
*((char *) strstr(reply.buf, "</text>")) = '\0';
fprintf(stderr, "Error returned from Clusterpoint Server: %s\n", strstr(reply.buf, "<text>") + 6);
} else {
*((char *) strstr(reply.buf, "</docid>")) = '\0';
printf("Document inserted with id %s\n", strstr(reply.buf, "<docid>") + 7);
}
/*
cleanup */
free(buf);
free(reply.buf);
free(title_esc);
free(text_esc);
free(post_data);
}
/*
final cleanup */
free(storage_esc);
free(user_esc);
free(passwd_esc);
free(encoding_esc);
curl_easy_cleanup(curl_handle);
curl_global_cleanup();
return 0;
}
This application is implemented in the Perl
programming language.
The application reads files from the file
system and imports them to the Clusterpoint storage. The application is
receiving file names as command line arguments.
#
# This application is implemented
in the Perl programming language.
#
# The application reads files from
the file system and imports them to the
# Clusterpoint storage through HTTP
POST interface using libcurl.
#
# The application receives file
names as command line arguments.
#
# It also detects whether the file
is a text file or a binary file by counting
# whitespaces in it: if a file
contains relatively less whitespaces, it is
# considered to be a binary file,
and if a file contains relatively more
# whitespaces, it is considered to
be a text file.
#
use HTTP::Request::Common;
use LWP::UserAgent;
use File::stat;
# connection parameters
$url = "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi";
$storage = "test";
$user = "name";
$passwd = "pass";
$encoding = "US-ASCII";
$REQUIRED_WHITESPACE_FRACTION = 0.12;
if (@ARGV == 0) {
print "Usage: [-r url] [-s storage] [-u user] [-p password] [-e encoding] files\n";
exit;
}
# read options
for ($i = 0; $i < @ARGV; $i++) {
if (substr($ARGV[$i], 0, 1) eq '-') {
if ($i + 1 >= @ARGV) { last; } # no option value
$opt = substr($ARGV[$i], 1, 1);
$val = $ARGV[$i + 1];
if ($opt eq 'r') {
$url = $val;
} elsif ($opt eq 's') {
$storage = $val;
} elsif ($opt eq 'u') {
$user = $val;
} elsif ($opt eq 'p') {
$passwd = $val;
} elsif ($opt eq 'e') {
$encoding = $val;
} else {
print "Unknown option: ", $ARGV[$i], "\n";
}
$i++;
}
}
$ua = LWP::UserAgent->new;
# names of files to be imported are
passed as arguments
# process each of them
for ($i = 0; $i < @ARGV; $i++) {
# check if argument is option
if (substr($ARGV[$i], 0, 1) eq '-') {
if ($i + 1
>= @ARGV) { last; } # no option value
$i++;
next;
}
print "Reading file: '", $fn = $ARGV[$i], "'\n";
# open file
if (open(f, $fn)) {
# retrieve file information
if ($st = stat(*f)) {
if (($st->mode & S_IFMT) == S_IFREG) {
print "\tSize: ", $st->size, " bytes\n";
# read all of it into memory
#
note: this sample program asumes all of file fits into memory
#
so if you need to work with larger files figure out something else
if (sysread(*f, $buf, $st->size) == $st->size) {
# see if it is text file
#
estimate that by counting whitespace in it:
#
natural language text in contrary to binary data
#
must contain significant portion of whitespace
$nspaces
= $buf =~ s/(\s)/
if ($nspaces >= $st->size * $REQUIRED_WHITESPACE_FRACTION) {
# execute Clusterpoint Server insert command through HTTP
POST interface */
$response = $ua->request(POST $url, [
storage => $storage,
command => 'insert',
user => $user,
password => $passwd,
id => $fn,
title => $fn,
rate => 100,
text => $buf,
encoding => $encoding
]);
if ($response->is_success && $response->content) {
if
($response->content !~ /<cpse:error>/) { #
simplified error check
$response->content =~ /<docid>([^<]*)<\/docid>/;
print
"Document inserted: docid =
} else {
$response->content =~ /<code>([^<]*)<\/code>/;
print
STDERR "Error returned from Clusterpoint Server:
$response->content =~ /<text>([^<]*)<\/text>/;
print
STDERR "
}
} else {
print STDERR "Error connecting to Clusterpoint Server: ", $response->code, ' - ', $response->message, "\n";
}
} else {
print "\tBinary file: ignored\n";
}
} else {
print STDERR "Error reading file\n";
}
} else {
print STDERR "File '$fn' is not a regular file\n";
}
} else {
print STDERR "Filesystem error retrieving info on '$fn'\n";
}
close(f);
} else {
print STDERR "Could not open file '$fn'\n";
}
}
This application is implemented in the PHP
programming language.
The application searches the Clusterpoint
storage and returns the results in HTML.
<?
//
// This application is implemented
in the PHP programming language.
//
// The application searches the Clusterpoint
storage using HTTP API and returns the
// results in HTML.
//
$Clusterpoint Server_SERVER = "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi";
$Clusterpoint Server_STORAGE = "test";
$Clusterpoint Server_USER = "name";
$Clusterpoint Server_PASSWD = "pass";
//search query
$query = $_GET["q"];
//current position in results
$curr_position = $_GET["p"];
//data encoding
$encoding = "UTF-8";
//send http header with correct
encoding
send_headers($encoding);
//max results on page
$result_on_page = 10;
if (empty($curr_position) || $curr_position < 0) {
$curr_position = 0;
}
//max page from one site to show
$max_page_from_site = 2;
$xml_text = file_get_contents($Clusterpoint Server_SERVER . "?storage=$Clusterpoint Server_STORAGE&command=search&user=" . urlencode($Clusterpoint Server_USER) . "&password=" . urlencode($Clusterpoint Server_PASSWD) . "&query=" . urlencode($query) . "&docs=$result_on_page&offset=$curr_position&from_site=$max_page_from_site&encoding=UTF-8");
if ($xml_text == "") {
die("Clusterpoint Server_search error!");
}
//initialize xml to array object
$xml2a = new XMLToArray();
//parse xml
$root_node = $xml2a->parse($xml_text);
//pop root node from array
$cpse_reply = array_shift($root_node["_ELEMENTS"]);
//array for storing data from
search results
//like total time spent, hits, and
so on
$spec_data = array();
// examining Clusterpoint Server reply elements
foreach ($cpse_reply["_ELEMENTS"] as $cpse_reply_el) {
if ($cpse_reply_el["_NAME"] == "seconds") {
$spec_data[$cpse_reply_el["_NAME"]] = $cpse_reply_el["_DATA"];
}
// examining Clusterpoint Server
content elements folder
foreach ($cpse_reply_el["_ELEMENTS"] as $cpse_content) {
$spec_data[$cpse_content["_NAME"]] = $cpse_content["_DATA"];
$last_site = '';
foreach($cpse_content["_ELEMENTS"] as $results) {
$tit = "";
$others = "";
//parse
each document tag from the result set
foreach($results["_ELEMENTS"] as $documents) {
switch ($documents["_NAME"]) {
case "title" :
$tit .= '<b>'.$documents["_DATA"].'</b>';
break;
case "id" :
$others .= '<br/><font size="-1"> ID: '.$documents["_DATA"].'</font>';
break;
case "site" :
if ($last_site == $documents["_DATA"])
$blockquote = TRUE;
else
$blockquote = FALSE;
$others .= '<br/><font size="-1"> Site: <i>'.$documents["_DATA"].'</i></font>';
$last_site = $documents["_DATA"];
break;
case "rate" :
break;
case "info" :
break;
case "text" :
if (!empty($documents["_DATA"]))
$others .= '<br/><font size="-1"> Snippet: <i>'.$documents["_DATA"].'</i></font>';
break;
}
}
//tab sites
if ($blockquote)
$output .= '<blockquote>'.$tit.$others.'</blockquote>';
else
$output .= '<br>'.$tit.$others.'<br>';
}
}
}
if ($spec_data["hits"] == 0) {
$output = "Your search <b>$query</b> did not match any documents!";
} else {
//page
listing
$from = $curr_position + 1;
$to = $curr_position + strval($spec_data["found"]);
$list_begin_pos=0;
$list_end_pos=$curr_position+($result_on_page*10);
$page_list .= "<font size=\"-1\">";
$p = 1;
if($curr_position > ($result_on_page * 10)){
$list_begin_pos=$curr_position-($result_on_page*10);
$p=intval($list_begin_pos/$result_on_page)+1;
}
if ($curr_position > 0) {
$page_list .= " <a href=\"search.php?p=". ($curr_position - $result_on_page) ."&q=".urlencode(stripslashes($query))."\"><<Previous</a> ";
}
$more_tag = $spec_data["more"];
if ($more_tag[0] == '=') {
$more = substr($more_tag,1);
} else {
$more = substr($more_tag,1) + 1;
}
for ($i = $list_begin_pos; $i - ($curr_position + $more) < $result_on_page && $i < $list_end_pos; $i+= $result_on_page) {
if($i>=1000) break;
if ($i != $curr_position) {
$page_list .= "<a href=\"search.php?p=$i&q=".urlencode(stripslashes($query)).(!empty($dir)?"&dir=$dir":"")."\">$p</a> ";
} else {
$page_list .= "<b>$p </b>";
}
$p++;
}
if (($result_on_page+$curr_position)-($curr_position+$more) < 10 && $curr_position + $result_on_page < 1000) {
$page_list .= " <a href=\"search.php?p=".($curr_position + $result_on_page)."&q=".urlencode(stripslashes($query))."\">Next>></a>";
}
$page_list .= "</font>";
//end
of page listing
}
//echo output to client
echo '
<html>
<head>
<style><!--
body,td,p,a{font-family:arial,sans-serif;}
.servkat{color:003399; text-decoration:none}
.homepage{color:003399; text-decoration:none; font-size:10px;}
//-->
</style>
</head>
<body>
<table>
<tr bgcolor=\"#cccc66\">
<td><font size=\"-1\"> Searched for: <b>'.$query.'</b> Results: <b>'.$from.'</b> - <b>'.$to.'</b> from <b>'.$spec_data["hits"].'</b> Search lasted <b>'.$spec_data["seconds"].'</b> seconds </font> </td>
<tr>
</table>
'.$output.'<br>
'.$page_list.'
</body>
</html>';
//#########################################################
class XMLToArray
{
//----------------------------------------------------------------------
//
private variables
var $parser;
var $node_stack = array();
//----------------------------------------------------------------------
//
PUBLIC
//
If a string is passed in, parse it right away.
function XMLToArray($xmlstring="")
{
if ($xmlstring) return($this->parse($xmlstring));
return(true);
}
//----------------------------------------------------------------------
//
PUBLIC
//
Parse a text string containing valid XML into a multidimensional array
//
located at root node.
function parse($xmlstring="")
{
//
set up a new XML parser to do all the work for us
$this->parser = xml_parser_create();
xml_set_object($this->parser, $this);
xml_parser_set_option($this->parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($this->parser, "startElement", "endElement");
xml_set_character_data_handler($this->parser, "characterData");
//
Build a Root node and initialize the node_stack
$this->node_stack = array();
$this->startElement(null, "root", array());
//
parse the data and free the parser...
xml_parse($this->parser, $xmlstring);
xml_parser_free($this->parser);
//
recover the root node from the node stack
$rnode = array_pop($this->node_stack);
//
return the root node
return($rnode);
}
//----------------------------------------------------------------------
//
PROTECTED
//
Start a new Element. This is done by pushing the new element onto the stack
//
and reseting its properties.
function startElement($parser, $name, $attrs)
{
//
create a new node
$node = array();
$node["_NAME"] = $name;
foreach ($attrs as $key => $value) {
$node[$key] = $value;
}
$node["_DATA"] = "";
$node["_ELEMENTS"] = array();
//
add the new node to the end of the node stack
array_push($this->node_stack, $node);
}
//----------------------------------------------------------------------
//
PROTECTED
//
End an element. This is done by popping the last element from the
//
stack and adding it to the previous element on the stack.
function endElement($parser, $name)
{
//
pop this element off the node stack
$node = array_pop($this->node_stack);
$node["_DATA"] = trim($node["_DATA"]);
//
and add it an element of the last node in the stack...
$lastnode = count($this->node_stack);
array_push($this->node_stack[$lastnode-1]["_ELEMENTS"], $node);
}
//----------------------------------------------------------------------
//
PROTECTED
//Collect
the data onto the end of the current chars.
function characterData($parser, $data)
{
//
add this data to the last node in the stack...
$lastnode = count($this->node_stack);
$this->node_stack[$lastnode-1]["_DATA"] .= $data;
}
//----------------------------------------------------------------------
}
//#########################################################
//## END OF CLASS
//#########################################################
//sends Content-type header to
client browser
function send_headers($encoding)
{
Header("Content-type: text/html;charset=$encoding");
}
?>
This application is implemented in the ASP
programming language.
The application searches the Clusterpoint
storage and returns the results in HTML.
<%
'
' This application is implemented
in the VBScript programming language.
'
' The application searches the Clusterpoint
storage using HTTP API and returns the
' results in HTML.
'
%>
<html>
<head>
<title>Search</title>
<style>
#results div.header { margin-bottom: 35px; }
#results div.result { padding-left: 15px; }
#results p.title { margin-bottom: 3px; }
#results p.snip { margin: 0px; }
#results p.id { margin-top: 3px; font-size: 0.9em; color: gray; }
#results p.error { color: red; }
#results .pagelist { padding-top: 20px; }
#results .pagelist p { display: inline; }
#results .pagelist ul { margin: 0px; padding: 0px; display: inline; }
#results .pagelist li { display: inline; }
</style>
</head>
<body>
<div id="results">
<%
nDocs = 10 'results per page
nPages = 10 'pages listed
Offset = Int(Request.QueryString("page")) * nDocs 'pages are numbered from 0, displayed from 1
sQuery = Request.QueryString("query")
Set Http = Server.CreateObject("MSXML2.ServerXMLHTTP")
Http.Open "POST", "http://127.0.0.1/cgi-bin/cpse/cpse-gw.cgi", False
Http.Send "storage=test&command=search&docs=" & nDocs & "&offset=" & Offset & "&relevance=yes&query=" & Server.URLEncode(sQuery)
if Http.Status = 200 and not Http.ResponseXML is Nothing then
Set Dom = Http.ResponseXML
Dom.SetProperty "SelectionNamespaces", "xmlns:cpse='www.clusterpoint.com'"
Set Content = Dom.SelectSingleNode("cpse:reply/cpse:content")
if not Content is Nothing then
'command executed ok
n = Int(Content.SelectSingleNode("hits").Text)
%><div class="header"><p>Found <b><% if n > 0 then Response.Write n else Response.Write "no"
%></b> document<% if n <> 1 then Response.Write "s"
if not Content.SelectSingleNode("real_query") is Nothing then sRealQuery = Content.SelectSingleNode("real_query").Text else sRealQuery = ""
%> matching "<%= Server.HTMLEncode(sRealQuery)
%>" (<b><%= Dom.SelectSingleNode("cpse:reply/cpse:seconds").Text %></b> seconds)</p></div><%= vbCrLf %><%
if n > 0 then
'something has been found
for each Result in Content.SelectNodes("results/document")
%><div class="result"><%
sTitle = Result.SelectSingleNode("title").Text
%><p class="title"><a href="<%= Server.HTMLEncode(Result.SelectSingleNode("id").Text) %>"><%= Server.HTMLEncode(sTitle) %></a></p><%
%><p class="snip"><%= Replace(Result.SelectSingleNode("text").Text, "#", " ") %></p><%
%><p class="id"><%= Server.HTMLEncode(Result.SelectSingleNode("id").Text) %></p><%
%></div><%= vbCrLf %><%
next
'page listing
iFrom = Int(Content.SelectSingleNode("from").Text)
nMore = Int(Mid(Content.SelectSingleNode("more").Text, 2))
nSure = Int((nMore + iFrom + 2 * nDocs - 1) / nDocs)
if iFrom > 0 or nMore > 0 then
%><div class="pagelist"><%= vbCrLf %><p>Result pages:<%= vbCrLf %><%
iPage = Int(iFrom / nDocs)
i = Int((iPage - 1) / (nPages - 2)) * (nPages - 2)
if i < 0 then i = 0
%><ul><%= vbCrLf %><%
sLink = "<a href=""search.asp?query=" & Server.URLEncode(sQuery) & "&page="
if iPage > 0 then
%><li><%= sLink & (iPage - 1) %>"><<< Previous</a></li><%= vbCrLf %><%
end if
j = i
do while j < i + nPages and j < nSure
%><li><%
if j = iPage then
%><b><%= j + 1 %></b><%
else
%><%= sLink & j %>"><%= j + 1 %></a><%
end if
%></li><%= vbCrLf %><%
j = j + 1
loop
if nMore > 0 then
%><li><%= sLink & (iPage + 1) %>">Next >>></a></li><%= vbCrLf %><%
end if
%></ul><%= vbCrLf %></p><%= vbCrLf %></div><%= vbCrLf %><%
end if
end if
else
'error
Set Content = Dom.SelectSingleNode("cpse:reply/cpse:error")
%><p class="error">Error <%= Content.SelectSingleNode("code").Text %>: <%= Server.HTMLEncode(Content.SelectSingleNode("text").Text) %></p><%= vbCrLf %><%
end if
else
%><p class="error">Search failed!</p><%= vbCrLf %><%
end if
%> </div>
</body>
</html>
This application is implemented as Java applet.
The application searches the Clusterpoint
storage and displays results.
import java.awt.*;
import javax.swing.*;
import java.awt.event.*;
import java.util.Random;
public class Clusterpoint ServerJApi extends JApplet implements ActionListener {
private JPanel contentPane;
private JPanel Buttons = new JPanel();
private JPanel Results = new JPanel();
private JPanel Properties = new JPanel();
private JPanel MainPan = new JPanel();
private JLabel HostLabel = new JLabel("Host:");
private JLabel StorageLabel = new JLabel("Storage:");
private JLabel QueryLabel = new JLabel("Query:");
private JButton SearchButt = new JButton("Search");
private JButton ClearButt = new JButton("Clear");
private JTextField QueryField = new JTextField(10);
private JTextField HostField = new JTextField("http://",20);
private JTextField StorageField = new JTextField(10);
private JTextArea ResultArea = new JTextArea();
public void init() {
//ContentPane Layout
contentPane = (JPanel) this.getContentPane();
contentPane.setLayout(new BorderLayout());
//Main pane
MainPan.setLayout(new GridBagLayout());
//Properties pane
Properties.setLayout(new GridBagLayout());
Properties.setBorder(BorderFactory.createTitledBorder("Properties"));
//Buttons pane
Buttons.setLayout(new GridLayout(1,2,5,0));
//Results pane
GridLayout gridLayout1 = new GridLayout();
gridLayout1.setVgap(0);
gridLayout1.setHgap(0);
gridLayout1.setColumns(1);
gridLayout1.setRows(10);
Results.setLayout(new BorderLayout());
Results.setBorder(BorderFactory.createTitledBorder("Results"));
//Add Main pane to contentPane
contentPane.add(MainPan,BorderLayout.NORTH);
//Properties pan to Main pane
MainPan.add(Properties,new GridBagConstraints(0, 0, 1, 1, 1.0, 1.0
,GridBagConstraints.NORTH, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));
//Add controls to Properties Pane
Properties.add(HostLabel,new GridBagConstraints(0, 1, 1, 1, 1.0, 1.0
,GridBagConstraints.WEST, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 1, 0));
Properties.add(HostField,new GridBagConstraints(1, 1, 1, 1, 1.0, 1.0
,GridBagConstraints.WEST, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));
Properties.add(StorageLabel, new GridBagConstraints(2, 1, 1, 1, 1.0, 1.0
,GridBagConstraints.WEST, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 1, 0));
Properties.add(StorageField, new GridBagConstraints(3, 1, 1, 1, 1.0, 1.0
,GridBagConstraints.WEST, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));
Properties.add(QueryLabel, new GridBagConstraints(0, 2, 1, 1, 1.0, 1.0
,GridBagConstraints.WEST, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 0, 0));
Properties.add(QueryField, new GridBagConstraints(1, 2, 3, 1, 1.0, 1.0
,GridBagConstraints.WEST, GridBagConstraints.HORIZONTAL, new Insets(1, 1, 1, 1), 0, 0));
//Add Buttons to Main Pan
MainPan.add(Buttons,new GridBagConstraints(0, 1, 1, 1, 1.0, 1.0
,GridBagConstraints.NORTH, GridBagConstraints.NONE, new Insets(1, 1, 1, 1), 20, 0));
Buttons.add(SearchButt,null);
Buttons.add(ClearButt,null);
MainPan.add(Results,new GridBagConstraints(0, 2, 1, 1, 1.0, 1.0
,GridBagConstraints.NORTH, GridBagConstraints.BOTH, new Insets(1, 1, 0, 1), 0,0));
Results.add(ResultArea);
SearchButt.addActionListener(this);
ClearButt.addActionListener(this);
}
//Action listener search and clear buttons
public void actionPerformed(ActionEvent e) {
if (e.getSource() == ClearButt) {
QueryField.setText("");
} else if (e.getSource() == SearchButt) {
String cgiUrl = new String(HostField.getText());
Clusterpoint ServerExch cpseReq;
//Create Clusterpoint XML query
Clusterpoint ServerMess cpseXMLQuery = new Clusterpoint ServerMess("search",StorageField.getText(),"name","pass",QueryField.getText());
//Do data exchange with Clusterpoint Server
cpseReq = new Clusterpoint ServerExch(cgiUrl,cpseXMLQuery.getMess());
cpseReq.doQuery();
String temp = cpseReq.getResponse();
//Parse out XML results
Clusterpoint ServerXMLParser cpseXMLResp = new Clusterpoint ServerXMLParser(temp.trim());
String[][] resArray = new String[10][];
resArray = cpseXMLResp.parse();
String outp = "";
//Format output
for (int i = 0; i < cpseXMLResp.getResultLength(); i++) {
System.out.println("URL["+i+"]: "+resArray[i][1]+" Title: "+resArray[i][0]);
JLabel u;
outp += "Title: "+resArray[i][0]+"\n";
outp += "ID: "+resArray[i][1]+"\n\n";
}
ResultArea.setText(new String(outp));
}
}
}
import java.util.Calendar;
public class Clusterpoint ServerMess {
private String iComm;
private String iData;
private String iUser;
private String iPass;
private String iStorage;
private String message;
/** Creates new Clusterpoint ServerMess */
public Clusterpoint ServerMess(String command,String storage,String user, String passwd) {
iComm = command;
iData = null;
iUser = user;
iPass = passwd;
iStorage = storage;
message = ComposeMess();
}
public Clusterpoint ServerMess(String command,String storage, String user, String passwd,String data) {
iComm = command;
iData = data;
iUser = user;
iPass = passwd;
iStorage = storage;
message = ComposeMess();
}
public String getMess() {
return message;
}
private String ComposeMess() {
String mess = "";
long current = System.currentTimeMillis();
mess = "<cpse:request xmlns:cpse=\"www.clusterpoint.com\">";
mess += "<cpse:timestamp>"+Calendar.YEAR+"/"+Calendar.MONTH+"/"+Calendar.DAY_OF_MONTH+" "+Calendar.HOUR+":"+Calendar.MINUTE+":"+Calendar.SECOND+"</cpse:timestamp>";
mess += "<cpse:command>"+iComm+"</cpse:command>";
mess += "<cpse:requestid>"+current+"</cpse:requestid>";
mess += "<cpse:storage>"+iStorage+"</cpse:storage>";
mess += "<cpse:reply_charset>UTF-8</cpse:reply_charset>";
mess += "<cpse:user>"+iUser+"</cpse:user>";
mess += "<cpse:password>"+iPass+"</cpse:password>";
mess += "<cpse:application>JavaApi</cpse:application>";
mess += "<cpse:content>";
if (iComm == "search") {
mess += "<query>"+iData+"</query>";
mess += "<docs>10</docs>";
}
mess += "</cpse:content>";
mess += "</cpse:request>";
return mess;
}
}
import java.io.*;
import java.net.*;
import javax.swing.*;
public class Clusterpoint ServerExch {
private String iHost;
private String iData;
private String query;
private String response;
private String iFname;
/** Creates new cpse_network */
public Clusterpoint ServerExch(String Host, String data) {
iData = data;
URL aURL=null;
try {
aURL = new URL(Host);
} catch (MalformedURLException e) {
System.out.println("Malformed URL");
}
iHost = aURL.getHost();
iFname = aURL.getFile();
}
public int doQuery() {
try {
Socket socket = new Socket(iHost,80);
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));
socket.setSoTimeout(60000); // set 1 minute timeout
String header = "POST "+iFname+" HTTP/1.0\r\n";
header += "Host: "+iHost+"\r\n";
header += "User-Agent: Clusterpoint Server Client Sample\r\n";
header += "Content-Length: " + iData.getBytes("UTF-8").length+"\r\n\r\n";
wr.write(header);
wr.write(iData);
wr.flush();
response = read_socket(socket);
wr.close();
socket.close();
} catch (UnknownHostException e) {
System.err.println("Exception: Unknown host " + iHost + "!");
System.exit(1);
} catch (IOException e) {
System.err.println("Exception: I/O error during connection!");
System.exit(1);
}
return 1;
}
public String getResponse() {
return response;
}
private String read_socket(Socket fsocket){
String reply="";
try{
BufferedReader rd= new BufferedReader(new InputStreamReader(fsocket.getInputStream()));
StringBuffer tempresp = new StringBuffer();
int ch=0;
while (true){
ch = rd.read();
if (ch < 0)
break;
else
tempresp.append((char)ch);
}
reply = new String(tempresp);
reply = reply.substring(reply.indexOf("\r\n\r\n"));
} catch (InterruptedIOException e){
return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";
} catch (UnknownHostException e){
return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";
} catch (IOException e){
return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";
} catch (NullPointerException e){
return "<cpse:reply>\n<cpse:error>\n<text>"+e.getMessage()+"</text><source>API</source><level>failed</level>\n</cpse:error>\n</cpse:reply>";
}
return reply;
}
}
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
public class Clusterpoint ServerXMLParser {
private String iData;
private String[][] Resultset = new String[10][2];
private int ResCount = 0;
/** Creates new Clusterpoint ServerXMLParser */
public Clusterpoint ServerXMLParser(String data) {
iData = data;
}
public String[][] parse() {
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(iData)));
NodeList nodes = doc.getElementsByTagName("document");
ResCount = nodes.getLength();
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList title = element.getElementsByTagName("title");
Element line = (Element) title.item(0);
Resultset[i][0] = new String(getCharacterDataFromElement(line));
NodeList url = element.getElementsByTagName("id");
line = (Element) url.item(0);
Resultset[i][1] = new String(getCharacterDataFromElement(line));
}
} catch (Exception e) {
e.printStackTrace();
return null;
}
return Resultset;
}
public int getResultLength() {
return ResCount;
}
public String getCharacterDataFromElement(Element e) {
Node child = e.getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "?";
}
}
Generally error handling in Clusterpoint Server
system has been implemented as simple adding of XML 'error' tag for every error
encountered.
Specific error message is returned in Clusterpoint
XML reply containing error code, error text, error severity description and
subsystem where error was generated.
If a command sent to the Clusterpoint Server is
not executed successfully, an error is returned in the following XML reply
message:
<?xml version="1.0" encoding="REPLY-ENCODING"?>
<cpse:reply xmlns:cps="www.clusterpoint.com">
<cpse:storage> Storage name </cpse:storage>
<cpse:timestamp> reply creation date and time </cpse:timestamp>
<cpse:command> command name for which the reply is created </cpse:command>
<cpse:requestid> message request id for which the reply is created </cpse:requestid>
<cpse:content> command specific data </cpse:content>
<cpse:seconds> time spent for the reply creation </cpse:seconds>
<cpse:error>
<code> error code (4 digits) </code>
<text> error textual message </text>
<level> error severity </level>
<source> subsystem in which the error occurred </source>
<document_id> document_id that the error refers to - for some errors </document_id>
</cpse:error>
</cpse:reply>
The error severity can be one of the following:
|
Title |
Description |
|
Warning |
Returned when the command is executed successfully, but there are some problem indications |
|
Failed |
Returned when incorrect input data. |
|
Error |
Returned when error in the command execution. |
|
Fatal |
Returned when the system is not functioning. |
The purpose of the error severity is to inform
the system:
·
If the
error severity is fatal or error, the system work must be interrupted and the
system administrator must be informed.
·
If the
error severity is failed or warning, the errors can be logged and analyzed,
while the system work can be continued.
Clusterpoint Server is a transaction-based
system, which means that commands has a predefined timeout period. If a command
is not executed during this predefined timeout period, the command returns the
error.
It is possible to define a timeout period for
the request, or configure it for the Clusterpoint Server.
For more information on configuring timeout
periods for the Clusterpoint Server, see the Clusterpoint Server User Guide.
The
following section contains a list of most common error messages and their
codes. Error messages can differ from
the listed texts and the list could be incomplete as new software versions are
released from time to time. Please
contact Clusterpoint for updated list of Clusterpoint Server Error Messages.
Following table contains most frequently
occurred error codes. If you get error code, that is not found in this table
and meaning cannot be understood from message, please contact technical
support.
|
Code |
Level |
Error message |
Possible causes |
Suggested solutions |
|
1212 |
Fatal |
Out of memory |
Server is out of
memory. |
Stop unused
storages and other processes. This will most likely cause data corruption. |
|
1419 |
Error |
Connection to
node has failed. |
Cluster node (or
storage) is down. |
Start stopped
storages, check network connectivity. |
|
1517 |
Fatal |
I/O error. Unable
to write to disk. |
There are some
problems with disk or disk is full. |
If disk is full,
delete unnecessary data. If disk is not full - check for hardware errors. |
|
1518 |
Fatal |
I/O error. Unable
to read from disk |
There are
problems with disk. |
Check for
hardware errors. |
|
1520 |
Error |
Cannot write to
temporary directory. |
Clusterpoint
Server process cannot write to temporary directory. |
Check storage
configuration, where you have defined temporary directory. Check permissions. |
|
1612 |
Error |
Invalid XML. |
Clusterpoint
Server got invalid XML in inter-process communications. |
Contact technical
support and report bug. |
|
1619 |
Rejected |
Invalid XML |
Request was
invalid XML. |
Fix request and
try again. |
|
1837 |
Rejected |
Invalid user name
and/or password or permission denied. |
Invalid user name
and/or password was passed with request or request operation is not allowed. |
Correct user name
and/or password or contact your administrator. |
|
1839 |
Fatal |
Invalid license -
invalid signature. |
License file is
corrupted or modified. |
Obtain valid
license. |
|
1842 |
Error |
License has
expired. |
Time-limited
license is expired. |
Obtain new
license to continue using Clusterpoint Server. |
|
1844 |
Fatal |
Invalid license -
wrong server ID. |
Installed license
file does not match server. |
Install correct
license for server. |
|
1845 |
Fatal |
License not
found. |
License is not
installed. |
Obtain and
install license. |
|
2231 |
Error |
Failed to
exchange data with gateway. |
Clusterpoint
Server process is not available. |
Start
Clusterpoint Server. |
|
2344 |
Rejected |
Storage not
available. |
Request was sent
to storage, that is not active or does not exist. |
Start stopped
storage or fix request. |
|
2417 |
Rejected |
Unknown command. |
Request contains
invalid command name. |
Fix request and
try again. |
|
2420 |
Warning |
Old namespace,
please upgrade client library. |
You are using
library, that uses old (1.x) request namespace. |
Upgrade (or fix)
library. |
|
2421 |
Rejected |
Single command not
found. |
You are trying to
execute cluster command in single node. |
Check and fix
your request. |
|
2422 |
Rejected |
Cluster command
not found. |
You are trying to
execute single command in cluster node. |
Check and fix
your request. |
|
2425 |
Rejected |
Missing required parameter. |
Your request
misses some mandatory parameters. |
Check
documentation and fix your request. |
|
2428 |
Failed |
Node not
available. |
Clusterpoint
Server tried to execute command on node, that is not available. |
Contact technical
support. |
|
2430 |
Error |
Received zero
length message. |
Inter-process
communication received empty message. |
This could mean
server overload and timeout in request processing. If it is not server
overload, please contact technical support. |
|
2437 |
Rejected |
Document root tag
is not present. |
Your data
modification request contains wrong (according to policy) document root tag. |
Check policy and
change request. |
|
2445 |
Rejected |
Auto-increment
feature is not available. |
You are sending
documents with autoincrement IDs, but autoincrement feature is not available. |
Restore
autoincrement feature using command reset-autoincrement-status. |
|
2448 |
Failed |
Stripe not
available, modifications denied. |
Al least one
stripe node is not available. To preserve cluster integrity, modification
requests are denied. |
Restore failed
node or remove it from configuration. |
|
2626 |
Rejected |
Duplicate primary
key. |
Your insert
requests contains documents with IDs, that already exists in storage. |
Change document
IDs, use update/replace request or delete existing document first. |
|
2824 |
Rejected |
Requested
document does not exist. |
You are trying to
retrieve document, that does not exist in storage. |
Change request
document ID to existing one. |
|
2826 |
Rejected |
Synchronization
source not found. |
You are trying to
perform synchronization from invalid source. |
Check and fix
request 'from' parameter. |
|
2830 |
Failed |
Long modification
process is running. |
Storage is
performing long modification process, for example reindex or restore. |
Try your request
later. |
For
different Clusterpoint Server versions the list of errors may slightly differ.
There are several groups of errors returned to
the developer application and logged out into log files. Each error group has a
range of 4-digit codes. There are the following groups of errors in the overall
Clusterpoint Server system - see Table 1:
TABLE 1. Groups of errors and their code range in Clusterpoint Server software
|
Code range |
Errors group |
Description of
errors group |
|
1xxx |
general errors |
general errors which can be returned by any subsystem |
|
2xxx |
database engine errors |
core database engine errors |
|
3xxx |
apps errors |
Clusterpoint application subsystem errors |
|
4xxx |
tools errors |
Clusterpoint tools errors (pre-packaged applications) |
From the application developer point of view
the most interesting errors are 'General errors' and 'Search engine
errors'. These errors are being
generated in the Clusterpoint Server core engine subsystems or Web server
gateway module, as responses to user application XML query messages. Application developer can process those errors
according to the custom business logic needs in his own application.
There are also more detailed groupings of error
types according to the area of problems encountered.
If there is an error code encountered which is
not present in the complete error list of this guide, please use the Table 2 to
check to which subgroup (and related problem area) that unlisted error could
relate. When looking for cause of the
problem for unlisted errors always look into Table2 for tips what might be
wrong.
TABLE 2. Error code ranges for specific problem areas
|
Code range |
Types of error |
Description |
|
11xx |
general problems |
general problems |
|
12xx |
memory problems |
memory problems |
|
13xx |
process problems |
process problems |
|
14xx |
network problems |
network problems |
|
15xx |
storage (disk) problems |
storage (disk) problems |
|
16xx |
xml parser problems |
xml parser problems |
|
17xx |
lock problems |
lock problems |
|
18xx |
users authorization problems |
users authorization problems |
|
19xx |
common |
common errors |
|
21xx |
general problems |
general problems |
|
22xx |
cpse |
Core database server errors |
|
23xx |
cps2-storage |
Storage demon errors |
|
24xx |
cps2-filter |
Filtered triggers subsystem errors |
|
25xx |
cpse-gw |
Clusterpoint Server Web server gateway module errors |
|
26xx |
cps2-master |
Clusterpoint Server master demon errors |
The complete list of all system errors and
their codes are shown in reference table below (Table 3).
TABLE 3. List of all software system errors in Clusterpoint Server
|
1111 |
ERROR_DEBUG_NOTIFICATION |
|
1112 |
ERROR_BUS_ERROR |
|
1113 |
ERROR_SYSTEM_BUSY |
|
1114 |
ERROR_SYSTEM_DOWN |
|
1115 |
ERROR_INVALID_RESPONSE |
|
1116 |
ERROR_INVALID_TAG |
|
1211 |
ERROR_MEMORY_CORRUPTION |
|
1211 |
ERROR_SEGMENTATION_FAULT |
|
1212 |
ERROR_OUT_OF_MEMORY |
|
1213 |
ERROR_MEMORY_MAPPING |
|
1214 |
ERROR_MEMORY_ALLOCATION |
|
1311 |
ERROR_CREATE_THREAD |
|
1411 |
ERROR_DAEMON_START |
|
1412 |
ERROR_DAEMON_RECEIVE |
|
1413 |
ERROR_DAEMON_SEND |
|
1414 |
ERROR_DAEMON_EXCHANGE |
|
1415 |
ERROR_DAEMON_SOCKET |
|
1511 |
ERROR_DISK_IO |
|
1512 |
ERROR_PERMISSION |
|
1611 |
ERROR_XML_PARSING |
|
1612 |
ERROR_BAD_XML |
|
1613 |
ERROR_XML_MANIPULATION |
|
1614 |
ERROR_XML_DUMP |
|
1615 |
ERROR_XML_CORRUPTED |
|
1711 |
ERROR_MODIFY_LOCK_TIMEOUT |
|
1712 |
ERROR_ACCESS_LOCK_FULL |
|
1713 |
ERROR_ACCESS_LOCK_TIMEOUT |
|
1714 |
ERROR_MODIFY_RELEASE |
|
1715 |
ERROR_ACCESS_RELEASE |
|
1811 |
ERROR_AUTH_MODULES |
|
1812 |
ERROR_AUTH_HANDLER |
|
1813 |
ERROR_AUTH_SKIP |
|
1814 |
ERROR_AUTH_ENUMERATESTOP |
|
1815 |
ERROR_AUTH_NOTSUPPORTED |
|
1816 |
ERROR_AUTH_PARAMETERS |
|
1817 |
ERROR_AUTH_FAILED |
|
1818 |
ERROR_AUTH_USEREXISTS |
|
1819 |
ERROR_AUTH_NOUSER |
|
1820 |
ERROR_AUTH_GROUPEXISTS |
|
1821 |
ERROR_AUTH_NOGROUP |
|
1822 |
ERROR_AUTH_INTERNAL |
|
1823 |
ERROR_AUTH_CONFLICT |
|
1824 |
ERROR_AUTH_CIRCULAR |
|
1825 |
ERROR_AUTH_UNAVAILABLE |
|
1826 |
ERROR_AUTH_DENIED |
|
1827 |
ERROR_AUTH_DUPLICATE |
|
1911 |
ERROR_QUEUE_DUMP |
|
1912 |
ERROR_QUEUE_PUSH |
|
1913 |
ERROR_TABLE_NOTIFICATION |
|
1914 |
ERROR_TABLE_STRUCTURE |
|
1915 |
ERROR_TABLE_READ |
|
1916 |
ERROR_TABLE_CLOSE |
|
1917 |
ERROR_TABLE_INTEGRITY |
|
1918 |
ERROR_TABLE_CONFIRM |
|
1919 |
ERROR_TABLE_ADVANCE |
|
1920 |
ERROR_TABLE_PUT |
|
1921 |
ERROR_TABLE_WRITE |
|
1922 |
ERROR_TABLE_SYNCHRONIZE |
|
1923 |
ERROR_TABLE_OPEN |
|
1924 |
ERROR_TABLE_TRANSFER |
|
2111 |
ERROR_STORAGE_UNAVAILABLE |
|
2112 |
ERROR_STORAGE_NOT_FOUND |
|
2113 |
ERROR_MISSING_REQUESTID |
|
2114 |
ERROR_WRONG_STORAGE |
|
2115 |
ERROR_AUTHORIZATION_USER |
|
2116 |
ERROR_AUTHORIZATION_IP |
|
2117 |
ERROR_AUTHORIZATION_PASS |
|
2118 |
ERROR_AUTHORIZATION_UNKNOWN |
|
2119 |
ERROR_INVALID_COMMAND |
|
2211 |
ERROR_CPSE_CONFIGURATION |
|
2311 |
ERROR_POSSIBLE_CORRUPTION |
|
2312 |
ERROR_DATA_CORRUPTION |
|
2313 |
ERROR_DUPLICATE_KEY |
|
2314 |
ERROR_KEY_NOT_FOUND |
|
2315 |
ERROR_DOCUMENT_UNAVAILABLE |
|
2411 |
ERROR_FILTER_COMMAND |
|
2412 |
ERROR_FILTER_CONFIGURATION |
|
2413 |
ERROR_FILTER_STRUCTURE |
|
2414 |
ERROR_FILTER_MESSAGE |
|
2415 |
ERROR_FILTER_ADD |
|
2416 |
ERROR_FILTER_GET |
|
2511 |
ERROR_GW_PRECONDITION |
|
2512 |
ERROR_GW_PARAMETER |
|
2513 |
ERROR_GW_TEMPLATE |
|
2514 |
ERROR_GW_REQUEST |
|
2515 |
ERROR_GW_XSLT |
|
2516 |
ERROR_GW_CONFIGURATION |
|
2517 |
ERROR_GW_CONVERT |
|
2518 |
ERROR_GW_STORAGE |
|
2519 |
ERROR_GW_COMMAND |
|
2520 |
ERROR_GW_DISPATCH |
|
2611 |
ERROR_IDX_OVERLOAD |
|
2612 |
ERROR_IDX_EXCEPTION |
|
2613 |
ERROR_IDX_3RD_CACHE |
|
2614 |
ERROR_IDX_CANCEL |
|
2615 |
ERROR_IDX_STATUS |
|
2616 |
ERROR_IDX_INTEGRITY |
|
2617 |
ERROR_IDX_READ |
|
2618 |
ERROR_IDX_POOL_STATE |
|
2619 |
ERROR_IDX_POOL_SAVE |
|
2620 |
ERROR_IDX_HASH |
|
2711 |
ERROR_MGR_CONFIGURATION |
|
2712 |
ERROR_MGR_NOTIFICATION |
|
2713 |
ERROR_MGR_REQUEST_TYPE |
|
2714 |
ERROR_MGR_NO_LOG |
|
2715 |
ERROR_EMPTY_CLUSTER |
|
2716 |
ERROR_TEMPLATE_PARSING |
|
2717 |
ERROR_NODE_UNAVAILABLE |
|
2718 |
ERROR_NODE_NETWORK |
|
2719 |
ERROR_NODE_ERROR |
|
2720 |
ERROR_NODE_RESPONSE |
|
2721 |
ERROR_NO_RESULT |
|
2722 |
ERROR_MGR_SIM_WORDS |
|
2723 |
ERROR_MGR_SIM_RESPONSE |
|
2724 |
ERROR_MGR_SIM_DOCUMENT |
|
2725 |
ERROR_MGR_SIM_PARAMETER |
|
2726 |
ERROR_MGR_REINDEX |
|
2811 |
ERROR_PARSER_TIMEOUT |
|
2812 |
ERROR_ALTERNATIVES |
|
2813 |
ERROR_VOC_CLOSE |
|
2814 |
ERROR_VOC_INIT |
|
2815 |
ERROR_WILDCARD |
Problems that occur when working with Clusterpoint
Server can be reported to Clusterpoint technical support.
To report a problem, provide the following to
the technical support:
·
name
of your company
·
Clusterpoint
DBMS license number
·
version
of the Clusterpoint DBMS software
·
relevant
Clusterpoint Server log file items
For contact information on technical support,
see Contact Information.
This
section contains the following frequently asked questions:
·
How can I import binary data
to Clusterpoint Server?
·
How can I make Clusterpoint
Server to automatically ignore common words when performing FTS?
·
How can I see the actual
query that used for FTS?
·
How can I export the
vocabulary with an Clusterpoint API command?
·
Is it possible to return
more than 1000 documents to the result set?
How can I import binary data to Clusterpoint Server?
To import binary data like MS Word or PDF
document files to the Clusterpoint storage, they must be entered in the info document part.
Note: Data
in the info part are not available for FTS. If
you want your data to be available for FTS, they must be stored as plain text.
Usually, binary data do not comply with the XML
formatting standard. However, to be imported to the Clusterpoint storage, they
must comply with the XML formatting standard. Therefore, before importing to
the Clusterpoint storage, you must encode the binary data to the base64
encoding or other.
For more information on document parts, see Understanding Clusterpoint Server Document Structure.
How can I make Clusterpoint Server to automatically ignore common words
when performing FTS?
The Clusterpoint Server can be configured to
ignore at search words that appear in the Clusterpoint storage most often using
the customer supplied ignored words list. These words are considered to be
common words that are ignored during FTS due to performance reasons.
It is possible to edit the ignored words list.
For more information on ignored words, see Ignored
Words.
How can I see the actual query that used for FTS?
Often the actual query that is used for FTS
differ from that you entered as a search query. Reasons for this can be the
following:
·
If the
original query contains words from the ignored words list, they are dropped
from the actual search query.
·
If the
original query contains wildcard patterns, a class of words created from the
wildcard pattern usage is entered in the actual search query.
To see the actual query used for FTS, use the <real_query>
tag of the XML
reply to the search command.
For more information on the search
command, see Search.
How can I export the vocabulary with an Clusterpoint API command?
The vocabulary is a list of all
unique words in the Clusterpoint storage. Unique words are found in documents
and added to the vocabulary while storing these documents to the Clusterpoint storage.
Each Clusterpoint storage has its own vocabulary.
Unfortunately, it is not possible to export the
vocabulary with any of the Clusterpoint API commands. However, on the file level the vocabulary is
stored in a text file, where each line contains one word. You can copy this
text file and view it.
For information on the vocabulary text file,
see the Clusterpoint Server User Guide.
For more information on vocabulary, see Understanding Storing Information in Clusterpoint Server.
Why do I get an error: connection failed when importing data to the Clusterpoint
Server from a Windows NT 4.0 or Windows 2000 environment?
Importing data to the Clusterpoint Server, just
like any other operation with the Clusterpoint Server, is performed by
transporting XML requests and replies via HTTP.
When importing large amount of data to the Clusterpoint
Server, many TCP/IP connections are opened. After the connections are closed,
they remain in the TIME_WAIT state for a definite time period.
By default, in the Windows NT 4.0 or Windows
2000 environment, the limit of the connections is inconsiderably small and the
TIME_WAIT state time period is too long.
Therefore, because the number of new
connections created per second can be very large and the closed connections
remain in the TIME_WAIT state for some time period, the number of connections
can reach the limit very fast.
In that case, the system does not allow to
create a new connection and the error is returned.
To configure the limit of the connections and
the TIME_WAIT state time period, configure the following key in the Windows NT
4.0 or Windows 2000 registry:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"TcpTimedWaitDelay"=dword:00000015
where, the value is a decimal number
representing seconds.
Is it possible to return more than 1000 documents to the result set?
By default, the limit of documents to be
returned to the result set is 1000. It is possible to increase this limit.
However, there is the following functionality, which is designed for the
maximum number of documents in the result set equal to 1000:
·
sorting
search results by the relevance
·
grouping
search results by a group
If you increase the limit of documents in the
result set, the limit will be applied for all functions, except, if sorting
search results by relevance or group, only 1000 documents will be returned to
the result set.
If you increase the limit of documents in the
result set, it means that transactions in the Clusterpoint Server will be
performed in a longer time period. Therefore, you should also increase the
timeout period of functions.
For more information on the relevance, see Relevance.
For more information on grouping documents by a
group, see Search.
When I import large amount of data to the Clusterpoint storage, why are
they not available for FTS for a while?
When importing data to the Clusterpoint storage,
if the memory reserved for memory cache is not enough for the data amount being
imported, then:
1.
The
data being imported are written to another cache, which is written to the disk,
and the index state is expanding.
4. When the importing is complete, the Clusterpoint
Server is committing data written on the disk to the Clusterpoint Index, and
the index state is collapsing.
While the index state is expanding or
collapsing, the data written to the disk are not available for FTS. Only when
data are added to the Clusterpoint Index, they are available for FTS.
For example, if the massive data amount to be
imported is tens of GB, these data will not be available for FTS for few hours.
For more information on the index state, see Status.
This appendix
section contains parameter description, that can enable to set up some default
configuration options per each Clusterpoint storage.
In addition to configuring the Storage Document policy, which directly controls how document contents are handled during indexing and search, some additional configuration options can also be set for each storage.
These options are configured by using the Clusterpoint Manager or in XML configuration file ‘config.xml’, located into same name directory as the storage, by any text editor.
The default options are shown in the table below. Additional options may be available if required.
|
Configuration option |
Full path |
Possible values |
Description |
|
General
configuration |
|||
|
Temporary
directory |
/config/tmpdir |
A
valid filesystem path (default is "/tmp") |
Sets
the temporary directory for Clusterpoint Server. Backups are expanded here
during restore. |
|
Backup
directory |
/config/backup_dir |
A
valid filesystem path |
Sets
the directory where backup files will be placed. |
|
Authorization
check |
/config/authorization |
yes/no
(default yes) |
Enables/disables
authorization check when executing any of requests. |
|
Debug
information |
/config/debug |
yes/no
(default no) |
Turns
on/off printing debug information in log files. |
|
Worker
pool size |
/config/workers |
Positive
integer (default 5) |
Set
default count of worker threads which servers request queue. |
|
Autostart
flag |
/config/bootable |
yes/no
(default no) |
Enables/disables
storage start on server start. |
|
Storage
description |
/config/description |
String |
Storage
description, showed in Clusterpoint Manager. |
|
Repository
configuration |
|||
|
Highlight
open mark |
/config/repository/highlight/open_mark |
Any
characters (default is "<b>") |
Sets
the character/character sequence that begins a highlighted word. |
|
Highlight
close mark |
/config/repository/highlight/close_mark |
Any
characters. (default is "</b>") |
Sets
the character/character sequence that ends a highlighted word. |
|
Bwd_chars |
/config/repository/snippet/bwd_chars |
Positive
integer (default is 150). |
When
creating a snippet, the snippet will begin with a word up to bwd_chars
before the position in the text which corresponds best to the query. |
|
P_words |
/config/repository/snippet/p_words |
Positive
integer number (default is 1). |
The
snippet will have no more than p_words + q_words*(number of
words in the query) words corresponding to the query. |
|
Q_words |
/config/repository/snippet/q_words |
Positive
integer (default is 2). |
|
|
Fwd_chars |
/config/repository/snippet/fwd_chars |
Positive
integer (default is 250). |
The
snippet will end if the next word in the document that corresponds to the
query is more than fwd_chars characters or fwd_punct
punctuation marks distant from the previous word that corresponds to the
query. |
|
Fwd_punct |
/config/repository/snippet/fwd_punct |
Positive
integer (default is 200). |
|
|
Max_size |
/config/repository/snippet/max_size |
Positive
integer (default is 2500). |
The
snippet will not be longer than max_size characters. If the snippet,
according to the rest of the parameters, would be much shorter than bwd_chars,
another snippet is created and appended to the previous one; such snippets
are separated by "...". |
|
Bucket
size |
/config/repository/bucket_size |
Positive
integer (default is 134217728 bytes). |
Default
size for bucket in bucket pool where data are stored. |
|
Bucket
saturation |
/config/repository/bucket_compact_saturation |
Positive
double in range 0.0-1.0 (default is 0.5). |
Threshold
when compacting mechanism starts to defragment buckets. |
|
Index
configuration |
|||
|
Memory
pool size |
/config/index/memory_pool_size |
Positive
integer (default is 100MB), in MB |
Sets
the size of the RAM area used for documents of each index (might be several
if stemming used) that have been inserted but not yet saved to disk. |
|
Specsymbols |
/config/index/specsymbols |
Character
sequence (default '_') |
Sets
the list of characters treated as parts of words. By default all non
alhpa-numeric characters are treated as seperators. For example if there is
email address in text "nobody@nowhere.com" then this will be
splited in 3 words: "nobody", "nowhere", "com".
If @ symbols is set as specsymbol then parser willwill split it in 2 parts:
"nobody@nowhere" and "com". Example of multiple
characters: <specsymbols>#@&</specsymbols> |
|
Tag
colocation distance |
/config/index/colocation_distance |
Positive
integer (default is 10000) |
Step
for shifting word positions to make colocation feature work. This parameter
should be increased if word count in colocated tags overcomes this value. |
|
Policy
in document |
/config/index/parse_indoc_policy |
yes/no
(default no) |
Enables/disables
policy parsing for documents which contain policy in itself. |
|
Application programming interface. |
|
|
B |
|
|
Boolean
expression |
Expression containing logical
operators like AND, OR, and NOT. |
|
C |
|
|
Clusterpoint API |
Standardized set of functions for accessing the Clusterpoint Server. See also: Clusterpoint Server. |
|
Clusterpoint Server document |
Unit in Clusterpoint storage against which searching is performed. It can be unstructured or XML structured. See also: Clusterpoint storage. |
|
Clusterpoint Server |
Stand-alone server for storing and retrieving information such as plain texts or XML structured documents. It can be run in one or more instances per computer. |
|
Clusterpoint storage |
Data collection for storing Clusterpoint Server documents in a format that ensures a search is performed very fast. Clusterpoint storage is contained by one Clusterpoint Server instance, and consists of vocabulary, document repository, and Clusterpoint Index. See also: Clusterpoint Server document, vocabulary, document repository, Clusterpoint Index. |
|
Feature that allows distinguishing between uppercase and lowercase characters. |
|
|
D |
|
|
Program or process, part of a larger program or process, that is dormant until a certain condition occurs and then is initiated to do its processing |
|
|
Place where all Clusterpoint Server documents are stored in the format, in which they were stored in the Clusterpoint Server system, for returning the documents on a search request. See also: Clusterpoint Server document. |
|
|
F |
|
|
Full text search. See also: FTS query. |
|
|
Full text search request to the Clusterpoint Server. See also: FTS. |
|
|
fuzzy search |
Feature that allows searching for words that sound the same but are spelled differently. |
|
I |
|
|
List of words, where each word has a list of pointers to Clusterpoint Server documents in which the word occurs. See also: Clusterpoint Server document. |
|
|
M |
|
|
markup search |
Feature that allows searching for words within specific markup. |
|
Feature that ensures the documents in different languages and character encodings can be stored and searched within one Clusterpoint storage. See also: Clusterpoint storage. |
|
|
P |
|
|
Set of document structure parts specific operations for data importing and retrieving to the Clusterpoint storage, special indexing methods, relevance weights rules, listing rules in search results, results grouping rules etc. See also: Clusterpoint storage. |
|
|
R |
|
|
Random access memory. |
|
|
Mechanism, which ensures that results are ordered in a result set according to assigned unchanging rate. |
|
|
Measure of the accuracy of the search results. See also: rate. |
|
|
S |
|
|
Fragment with an occurrence of a search term. |
|
|
Weight of a word in a document that is calculated according to the specific weight interval of the document part the word occurs. See also: specific weight interval. |
|
|
Integer range assigned to a document part that denotes the importance of the document part compared to other document parts. See also: specific weight. |
|
|
stemming |
Feature that allows searching for words and their declinations. |
|
U |
|
|
UTF |
UCS (universal character set) transformation format. |
|
V |
|
|
Vocabulary is a list of all unique words in the Clusterpoint storage. Unique words are found in documents and added to the vocabulary while storing these documents to the Clusterpoint storage. See also: Clusterpoint storage. |
|
|
W |
|
|
weight |
Relevance defining relative integre number from 0 to 100, used to rank XML document parts against each other |
|
wildcard search |
Feature that allows searching for unknown characters or phrases. |
|
X |
|
|
Extensible markup language. |
|
|
XML message that is returned when submitting a XML request. See also: XML request. |
|
|
XML message that is sent to the Clusterpoint Server to perform a Clusterpoint API command. See also: Clusterpoint API. |
|
65K – bloks
50-bucketi / main-N-index 400GB / 2gb bucketi rep-NNN- _code
[1] A demon is a program or process, part of a larger program or process, that is dormant until a certain condition occurs and then is initiated to do its processing.