Select .. Where


When you query documents in Clusterpoint database, the criteria for matching documents can be passed in the WHERE clause as an arbitrary JavaScript expression. A JavaScript expression is evaluated in the context of a document. Thus fields of the document are bound as local variables in the execution context. Note that instead of accessing document and its fields directly Clusterpoint uses indices that are built when document is inserted.

Documents that match the specified criteria will form the result set, you will learn about ordering and aggregation options in the subsequent sections.

Equality and Comparisons

Clusterpoint builds indices for the following field types: numeric, string, date. Since the WHERE clause is JavaScript expression you can use == operator to compare field to a specific value, such as:

SELECT * 
FROM product
WHERE name == "Super Bicycle"

or you can use >, <, >=, <= comparison operators

SELECT * FROM product
WHERE price > 300.00


If collection has documents with specific field, that is present not in each document, then by using "typeof" operator it is possible to select such documents containing certain field:

SELECT * FROM collection
WHERE typeof(field_name) != "undefined"

Fields containing Date values

The Clusterpoint v4 database can handle date-time information, database has special index for Date data type - "datetime". Upon document insertion every field that is a string gets examined for whether it looks like a date-time info. This can be turned off by adding a remove-type override for the data type DATETIME.

The typical date-time information in ISO8601 extended format looks like:
'2015-12-25T11:30:05+02:00' or '+2015-12-25T09:30:05.012345Z'.

You can use datetime functions to operate with datetime data in your database.

Here is an example of SELECT query with datetime function "YEAR":

SELECT username, reg_date, YEAR(reg_date)
FROM collection
WHERE username.CONTAINS("Kate")
You can find more detailed information about date/time data type in the article Date and Time.

Boolean Expressions

Since WHERE clause takes any JavaScript expressions, comparisons and other logical operators can be combined with &&, ||, ! and combined using (), such as:

SELECT * FROM product
WHERE (category == "Bicycles" && price >= 100) || price > 400

Arithmetic, String & Other Operations

Since WHERE accepts any JavaScript expression, the real power of JS/SQL comes from use of arbitrarily complex expressions. Arithmetic operators, mathematical functions and all JavaScript methods for build in types are readily available.

SELECT * FROM product
WHERE name.split(" ").length < 4 && order_price > 180

Functions

Ultimately JavaScript expressions can get complicated and instead of writing them inline it might be helpful to define a function. The JS/SQL model allows to prepend statements with function definitions or any other JavaScript code that initialized context before executing expression in the WHERE clause for every document.

function discount(category) {
discounts = {
"Bicycles" : 0.10,
"Bike": 0.25}
if (category in discounts) {
return discounts[category]
}
return 0;
}
SELECT category, price, discount(category), price * discount(category), price - price * discount(category) as final_price FROM product
WHERE price * discount(category) < 50

Also note that function is called form the context of each the document in which WHERE clause will be executed, thus function will be able to access field of the document, which are defined as variables in the outer scope, so the following would work as well:

function discount() {
// category is a field of the document, thus defined in the outer context
if (category == "Bike") return 0.25;
// price as well is a field of the document
// thus defined in the outer context and accessible from the function
if (price > 200.0) return 0.10;
return 0.0;
}
SELECT category, price, discount(), price * discount(), price - price * discount(category) as final_price FROM product
WHERE price * discount() < 50

Full Text Search

Full text search is an important part of Clusterpoint technology, since it can create real-time immediatelly consistent full text index right in the operational collection.

To select documents matching  documents against a specific full text query use CONTAINS function in the WHERE clause.

Note: FTS works only on those fields, that are indexed with "Full Text" (Text Search = ON in Data Model).

It can be used in the context of entire document, thus matching against any string field with full-text search index:

SELECT * FROM product
WHERE CONTAINS("Mountain Bike")

Or the same CONTAINS function can be applied to a particular field:

SELECT * FROM product
WHERE name.CONTAINS("Mountain Bike")

The basic rules of composing a full text search query are similar to those commonly used in search engines.

Single Term

The following returns true for every document containing word Bike:

CONTAINS("Bike")

Multiple Terms

To match documents against several terms those can be just separated with spaces. The following returns true for every document containing words Mountain and Bike anywhere in the document:

CONTAINS("Mountain Bike")

Phrase

If terms have to make up a specific phrase, they have to be enclosed in quotes. The following returns true for every document containing words Mountain and Bike in exact order:

CONTAINS("\"Mountain Bike\"")

Now there are more advanced full text search features like wildcards, proximity search etc:

Wildcard Patterns

To search for documents that contain a class of words, wildcards can be used to substitute:

  • exactly one unknown character with a question mark ("?")
  • zero or more unknown characters with an asterisk ("*")
  • a range of definite characters for one unknown character occurrence using square brackets ("[ ]")

Note: When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for. This limitation is introduced to preserve the high performance of the Clusterpoint. The extent of this limitation can be modified by changing collection configuration.

Examples:

The query returns documents that contain the words "map", "may", "mat", "max", and so on:

CONTAINS("ma?")

The query returns documents that contain the words "George", "Geotermal", "Geology", and so on:

CONTAINS("Geo*")

The query returns documents that contain the words "map" or "may":

CONTAINS("ma[py]")

The query returns documents that contain the words "counter", "club", "chapter", "country", "change", "chat", "council", "class", "cpu", "challenge", "church", "couple", "championship", and so on:

CONTAINS("c?[au]*")

Specsymbols

All non-alphanumeric symbols are considered as "separators" in Clusterpoint database. If you want to search by some of spec-symbols, it is necessary:
1. to define specsymbol in collection configuration;
2. to "escape" specsymbol in SELECT ...WHERE query.

To define specsymbol in database configuration, open database "Configure" section and choose tab "Configuration". Add necessary special characters to configuration, example:

<config>
   <index>
      <specsymbols>,;.&</specsymbols>
   </index>
</config>

Do not separate characters with "space", because in that way "space" will also considered as spec-symbol.
After "Configuration" is changed and saved, you need to Reindex database.

Escape specsymbols in WHERE clause. In example below character "," is escaped:

SELECT * FROM product
WHERE name.CONTAINS("Mountain\, Bike")

Ignored Search Terms

By default, Clusterpoint ignores common words and characters such as "and", "where", and "how", as well as certain single characters (letters), because they tend to slow down the search without improving search results. Common words and characters like this are called ignored words.

The Clusterpoint Server detects words that appear in the Database most often and adds them to the ignored words list. The extent of this limitation can be modified either by changing the configuration.

Ignored words are applied only for Full Text search. If Clusterpoint detects Ignored words in search request, then the response will contain information about Ignored words in field "problem_terms".

If a common word or a character is essential to getting the required results, it can be included by preceding it with a plus sign "+".

CONTAINS("George +and Mary")

The query returns documents that contain all three words: "George", "and", and "Mary".

Boosted Search Terms

It is possible to alter the weight values for parts of the query, in order to affect the relative importance of different query terms while sorting documents by relevance. For instance, when searching for George Lucas movie it might be beneficial to consider terms George Lucas to be more important for relevance than the term movie. This can be achieved with the weight boosting feature.

The weight boosting operator (^) alters weights for each term that are placed between the opening and closing ^ symbols. In order to have more fine-grained control over particular weight modifications, 3 different modes of operation are provided: incrementation, replacement, or multiplication.

With boosted operator you should define ordering by relevance ascending/descending:

CONTAINS("^boosted word^ search words")
ORDER BY RELEVANCE() DESC

Incrementation

The simplest way to increment the terms' weight by 10 is to surround them with the ^ operator:

CONTAINS("^George Lucas^ movie")

You have to be careful with this syntax and user-submitted queries, because if there are numbers and/or special symbols after the first ^ symbol, they might be interpreted as additional weight boosting operator parameters, as specified below.

Another way to specify the same operation is this:

CONTAINS("^+10 George Lucas^ movie")

This will add 10 to the weight value of every occurrence of "George or Lucas". For instance, if the original indexed weight value was 50, it will be incremented to 60.

Note that the increment can also be negative, so this is also valid:

CONTAINS("George Lucas ^-10 movie ^")

Replacement

CONTAINS("^100 George Lucas ^ movie")

This will replace the weight value of every occurrence of "George or Lucas" with 100. For instance, if the original indexed weight value was 50, it will be replaced with 100.

Multiplication

CONTAINS("^*1.5 George Lucas ^ movie")

This will multiply the weight value of every occurrence of "George or Lucas" by 1.5. For instance, if the original indexed weight value was 50, it will be replaced with 75.

Stemming

It is possible to include a word and its declinations, for example, "go" and "going", in one search request. This feature is especially useful for so-called synthetic languages, in which syntactic relations within sentences are expressed by the change in the form of a word that indicates distinctions of tense, person, gender, number, mood, voice, and case, for example, German, Russian and Latin.

In order for the feature to work correctly, you have to:

  • choose "Text Search" and specify stem-lang rule in Data model for the field that you are searching on. Possible values for the stem-lang parameter are listed in Data Model. It is possible to index several stemming variants;
  • use stemming operators "$" for requested parts in search query, that should be stemmed (to search for documents that contains a word or its declinations);
  • specify the stem-lang parameter in stem search request.
Example:

SELECT * FROM product
WHERE name.CONTAINS("$$en Mountain Bike$")

Stemming and collation for Latvian and Lithuanian languages

Clusterpoint supports possibility to ignore specific language letters in search requests (ā -> a; ģ -> g etc.):
lv/lvcs/lvco – for Latvian language
lt/ltcs/ltco – for Lithuanian language

Where:
c/co = collation (ignores specific letters)
s = stemming (includes word declinations)


There are available following stemming languages:
1) lv/lt - includes a word and its declinations, doesn't ignore specific language letters;
2) lvcs/ltcs – collation&stemming. Ignores specific language letters + executes declinations;
3) lvco/ltco – collation. Ignores specific language letters, but doesn't execute declinations.

Proximity Search

It is possible to define the maximum number of words that may appear between certain search terms.

To use this feature, the search terms must be specified as: @ N term1 term2 @, where N is the maximum count of words between the search terms, and term1 and term2 are search terms. Any number of search terms can be included in a proximity search.

The following returns true for documents that contain the words "phone" and "fax" not further than 4 words from each other.

CONTAINS("@ 4 phone fax @")

Fuzzy search

To select documents with approximate string matching against a specific fulltext query use CONTAINS_FUZZY function in the WHERE clause:

SELECT * FROM database
WHERE CONTAINS_FUZZY("keyword",numeric_value)

Value in the brackets is called "Levenshtein distance" and it defines how much the alternative may differ from the search term. The greater this value, the greater the allowed difference.

Fuzzy search can be used in the context of entire document, thus matching against any string field with full-text search index:

SELECT * FROM product
WHERE CONTAINS_FUZZY("Mountain Bike",1)

Or the same CONTAINS_FUZZY function can be applied to a particular field:

SELECT * FROM product
WHERE name.CONTAINS_FUZZY("Mountain Bike",1)

Search in simple arrays

Data model indexes allow to query documents by values in arrays.

We will use the following document example to describe the syntax of "array search":

{ 
"name": "Urban Bicycle",
  "category": "Urban_vehicle",
  "image_url": "schwinn_s29.jpeg",
  "description": "...",
  "color": ["black","red","white"],
  "order_price": 211.16,
  "price": 259.16,
  "value":[46, 13, 24],
  "value_string":["46", "13", "24"],
  "availability": "In Stock"
}

Search by string values

Fields should be indexed as "String" in Data Model.

"Full text" search with CONTAINS in all positions of array

SELECT * FROM collection
WHERE color.CONTAINS("red")

Full text search also could be implemented for numerical values indexed as "String". For this purpose appropriate field should be overridden as "String" in Data Model; or numeric values should be added as text field (in our example - field "value_string"):

SELECT * FROM collection
WHERE value_string.CONTAINS("24")

Search in certain array position

SELECT * FROM collection
WHERE color[1] == "red"

Search by numeric values

Fields should be indexed as "Float" or "Integer" in Data Model.

Search in certain array position

SELECT * FROM collection
WHERE value[0] == 46

Search by using indexOf() function

If position of value is not known, then function indexOf() could be used:

SELECT * FROM collection
WHERE value.length > 0 && value.indexOf(24) !=-1

Search in complex arrays

If your array structure is more complicated than described above, then you can also filter documents by object attributes stored in arrays. It is possible to define SELECT criteria for objects in arrays with javascript funkctions.

1) For example, our document structure is:

{
"info": [{
"name": "John Smith",
"job": "Accountant"
}, {
"name": "Anna Smith",
"job": "Librarian, freelancer"
}]
}


1.1)
Request that SELECTS documents, where at least one of the "name" object elements in "info" array matches requirement "John Smith":

SELECT *
FROM collection
WHERE info.length > 0 && info.some( v => { return v.name == "John Smith"; })


1.2) An alternative example of an object element check:

function check_name(info)
{
for (i = 0; i < info.length; i += 1)
if (info[i].name == "John Smith") return true;
return false;
}
SELECT * FROM collection WHERE check_name(info)


2) The second document structure example is with an array into array - array "info" -> array "data":

{
"info": [{
"data": [{
"name": "John Smith",
"job": "Accountant"
}, {
"name": "Anna Smith",
"job": "Librarian, freelancer"
}]
}]
}


To SELECT documents by "name" element the following query could be defined:

SELECT *
FROM collection
WHERE info.length > 0 && info.some(v => {return v.data.some(y => {return y.name == "John Smith"; });})

Snippet and Highlight

For visual representation of results you can use functions SNIPPET and HIGHLIGHT.

Highlight - this part will be listed in the search results and the search terms within this part will be highlighted.
Snippet - a snippet (short extract) from this part will be shown in the search results. The search terms will be highlighted.

SNIPPET/HIGHLIGHT general syntax is:

field.FUNCTION(<string_text>, <json_config>)

where

  •  boths arguments are not mandatory (it is possible to define only one of them or both in random order);
  •  "field" - should be the same as in WHERE field.CONTAINS part;
  •  <string_text> - string type parameter, that defines what text part should be highlighted or snippeted. If parameter is not defined, then searched value of "field" is used;
  •  <json_config> - json type object with highlight/snippet parameters, for example:

           {
               "highlight_open_mark": "<b>";
               "highlight_close_mark": "</b>";
            }

           If this object is not defined, then is used exactly the same as in example above.

Syntax example of HIGHLIGHT

SELECT *, name.HIGHLIGHT()
FROM product
WHERE name.CONTAINS("Bike")

Syntax example of SNIPPET

SELECT *, name.SNIPPET()
FROM product
WHERE name.CONTAINS("Bike")

List Words

List Words returns the list of words corresponding to a SELECT...WHERE clause, i.e. all the words that could possibly match a WHERE condition. List Words is most useful when the search query contains wildcards. List Words returns all possible matches and the number of times this word is found in the stored documents (you can define to list words in whole collection or in certain field). List Words can be used to offer search suggestions on the fly, by searching for the currently entered characters followed by an asterix "*".


LIST_WORDS default syntax:

SELECT *
FROM LIST_WORDS(collection.field)
WHERE word == "text*"

where:
- collection - is a collection name,
- field - document field name (optional; if not defined, List Words are executed on whole collection);
- word - search query with whildcard.

An example of request (list words in whole "product" collection):

SELECT *
FROM LIST_WORDS(product)
WHERE word == "bi*"
ORDER BY count DESC

An example of response:

"results": [
{
"word": "bike",
"count": 5
},
{
"word": "bicycle",
"count": 3
},
{
"word": "bicycles",
"count": 1
}
]

Alternatives

Alternatives returns a set of words from the collection vocabulary that are relevant alternatives to the given word (you can define to get alternatives for words in whole collection or for certain field). Alternatives are words which, for example, are similar in spelling or have a different language declination. For example, if you enter "bote", then "bite" and "byte" are returned.

Words not present in any of the indexed documents are not returned by this command.
This feature can be used for fuzzy searches and spelling error corrections.
Alternatives that occur in the collection more often and are more similar to the search term are returned first. But you can use ORDER BY to define your own ordering sequence.


ALTERNATIVES default syntax:

SELECT *
FROM ALTERNATIVES(collection.field)
WHERE word == "example"

where:
- collection - is a collection name,
- field - document field name (optional; if not defined, Alternatives are executed on whole collection);
- word == "example" - mandatory parameter, should be in the same format, but additional WHERE arguments could be defined.

If you want to get the result for certain field (xpath), it could be defined in FROM clause.


An example of request (search for Alternatives in certain field - "name"):

SELECT *
FROM ALTERNATIVES(product.name)
WHERE word == "bike"

An example of response:

"results": [
    {
      "word": "bike",
      "alternative": "bicycle",
      "count": 1,
      "idif": 9.6000003814697,
      "h": 0.81278771162033,
      "cr": 0.11111111193895
    },
    {
      "word": "bike",
      "alternative": "bike",
      "count": 9,
      "idif": 0,
      "h": 1,
      "cr": 1
    }
  ]

The response returns alternative values and calcultated parameters:

cr - minimum ratio between the occurrence of the alternative and the occurrence of the search term;
idif - a number that limits how much the alternative may differ from the search term, the greater the idif value, the greater the allowed difference;
h - a number that limits the overall estimate of the quality of the alternative, the greater the cr value and the smaller the idif value, the greater the h value.

You can use parameters "cr", "idif", "h" to filter results:

SELECT word, alternative, h AS quality
FROM ALTERNATIVES(collection.field)
WHERE word == "hello" && h > 2
ORDER BY quality

Geospatial search

Clusterpoint can handle various data and has special features for working with geospatial information.

In particular, geospatial restrictions can be done using the INSIDE function in a WHERE clause.

The INSIDE function takes a shape and point coordinates as parameters and tests whether the point belongs to the shape.  With some shapes the similar INSIDE_2D and INSIDE_GEO functions may also be used.  These functions are not necessarily mutually different, in fact INSIDE typically ends up doing exacly what either INSIDE_2D or INSIDE_GEO would do.
 

INSIDE(shape, x, y) /* equivalent to shape.inside(x, y), i.e. it calls the "inside" method of the shape */
INSIDE_2D(shape, x, y) /* equivalent to shape.inside_2D(x, y) */
INSIDE_GEO(shape, x, y) /* equivalent to shape.inside_GEO(x, y) */

The built-in shapes include Circle, Rectangle and Polygon.

circle = new Circle(x, y, r);
rectangle = new Rectangle(left, top, right, bottom);
polygon = new Polygon(point_array);

Circle

A circle can be constructed in two ways:

circle = new Circle(x, y, r); /* where all the parameters x, y and r (radius) are numeric */

/* Example */
circle = new Circle(1, 2, 3)

or:

circle = new Circle(x, y, s); /* where x and y are numeric while s is a string containing a numeric constant and a unit specification, possibly separated by whitespace */

/* Example */
circle = new Circle(13, 52, "100 km")

The following units are recognized:

  • meters (may also be written as metres, meter, metre, m),
  • kilometers (also kilometres, kilometer, kilometre, km),
  • feet (also foot, ft),
  • yards (also yard, yds, yd),
  • miles (also mile, mis, mi).

A circle has all three of the above-mentioned inside methods defined:

  • inside_2D that uses Euclidean distance computation to check whether a given point is inside it;
  • inside_GEO that uses the haversine formula and assumes geospatial coordinates;
  • inside which uses one or the other of the former two methods.

To be precise, if the circle has been given without providing any units it assumes that the faster inside_2D formula is good enough. If, however, units were given during the circle construction time, it assumes that it's a circle on Earth with geospatial coordinates used for its center and uses inside_GEO for any subsequent inside calculations.

Rectangle

A rectangle is an axis-aligned rectangle and its constructor takes 4 numeric values:

rect = new Rectangle(left, top, right, bottom)

For a 2D rectangle this corresponds to providing the (x, y)-coordinates for the top-left vertex followed by the (x, y)-coordinates of the bottom-right vertex. For a rectangle in geospatial coordinates (which probably resembles an isosceles trapezoid with curving sides more than it does a rectangle) this corresponds to longitude and latitude for the top-left vertex followed by longitude and latitude for the bottom-right vertex.

All three inside methods mentioned above are provided for the rectangle. The inside method is the same as inside_GEO, which is much like inside_2D but can handle the rectangles that intersect the +180/-180 degree meridian.

Polygon

A polygon can be constructed in one of two ways:

poly = new Polygon( [ [x1, y1], [x2, y2], [x3, y3], ... [xn, yn] ])

or

poly = new Polygon( [x1, y1, x2, y2, ... xn, yn ] )

In both cases there is one top-level parameter, which must be an array. If its first element is also an array, then all the elements must be arrays of two elements, each describing the (x, y)-coordinates of a polygon vertex. If, however, the first element is numeric, then all the elements must be numeric and there must be 2n elements to describe an n-vertex polygon — first the x and y-coordinates of one vertex, then x and y-coordinates for the next, and so on. In either case no checks are made for whether the sides of the polygon cross each other or not.

At this moment the polygon only has the inside_2D method defined and inside which is the same as inside_2D.

Geo example of SELECT with WHERE

circle = new Circle(-0.12, 51.5, "5 km");
SELECT * FROM places
WHERE INSIDE(circle, lat, long);

Notes and Limitations

Querying by the field that is not present in all documents and all shards


There are special cases, when in WHERE clause you query collection by the field, that is not present in all documents and all shards. As result, you get an error in response and document that is not reliable (content of response is described below).
This may occur, if document count with "rare" field is not present in each shard (there are few documents with such field).

As example, there is collection with 10 shards and 5 replicas in each shard. Total document count is 1000, but there are 6 documents with different structure containing "rare" field named "availability". Suppose, that these 6 documents are distributed in 5 different shards (2 of them are stored in one and the same shard).
When we try to execute query:

SELECT * FROM collection WHERE availability =="Limited"

The response will return following information:
- Error message "ReferenceError: Test is not defined";
- Error message "Incidental failure - repeating the operation might succeed";
- Document containing "availability":"Limited" may be returned. But this may not be reliable, because execution of request stops on first error;
- If you use REST, it will return "500 Internal Server Error" as well.

Why this happens? Request determines, that not all shards contains documents with certain field, that is why error is returned.

We will change this behaviour. At the moment we recommend to store documents with the same structure under one collection to avoid this situation.