|
Search
Features — International Languages
|
|
| |
| Unicode
Support |
 |
Unicode
support allows for indexing and searching of
non-English text, including every character set
supported by the Unicode standard. |
 |
In
addition to Unicode support, dtSearch offers
extensive alphabet customization options. |
 |
See Unicode
FAQ for more technical information. |
 |
For a general Unicode overview, see Unicode and Text Retrieval white paper. |
|
|
| Language
Extension Packs |
 |
The
dtSearch product line includes an English noise
word list and stemming rules (to find words such
as learn, learned, learns, learning, etc.
that are linguistically related). |
 |
dtSearch's
UK distributor offers pre-packaged sets of noise
word lists and stemming rules covering a wide
variety of European languages. Language
Extension Packs |
 |
The
Western European group includes (in addition
to English): Danish, Dutch, Finnish, French,
German, Italian, Norwegian, Portuguese, Spanish
and Swedish. |
 |
The
Eastern European group includes: Belarusian, Bosnian, Bulgarian, Croatian, Czech, Estonian, Greek, Hungarian, Latvian, Lithuanian, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Turkish, Ukrainian. Cyrillic
article |
 |
Licensing:
dtSearch Corp. can add either the Western
European group or the Eastern European group
onto a signed dtSearch developer license.
Please Contact
dtSearch for details. Both packages
may also be licensed directly from www.dtsearch.co.uk. |
 |
More
information on the Language
Extension Packs |
 |
Request a trial version |
 |
Visit
distributor's site in English, Français, Deutsch |
|
| Chinese, Japanese and Korean Text With No Word Breaks |
 |
Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. |
 |
Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. |
 |
To make this type of text searchable, enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters, so each character will be treated as single word. |
 |
dtSearch Desktop/Network: In Options > Preferences > Letters and Words, check the box to “Insert word breaks between Chinese, Japanese, and Korean characters in text.” |
 |
dtSearch Developer API: set dtsoTfAutoBreakCJK in Options.TextFlags. |
| Language
Analyzer API Integration |
 |
In addition to the extensive alphabet customization options available across the dtSearch product line, the dtSearch Engine also includes a Language
Analyzer API that can be used to integrate morphological analyzers and custom or dictionary-based word breakers into the dtSearch Engine indexing process. |
 |
The dtSearch Engine also includes an API for substituting a non-English language thesaurus for the existing English-language one. |
| Basis Technology's Rosette® Linguistics Platform Integration |
 |
The Rosette Linguistics Platform helps unlock the meaning of unstructured text by determining the language, and identifying the basic linguistic features and structure. Relying on code that is unique to each particular language, Rosette results in highly accurate Chinese, Japanese, Korean, and other international language morphological analysis. |
 |
The Rosette Linguistics Platform integrates with dtSearch search functionality through the dtSearch Engine’s Language Analyzer API. Essentially, the dtSearch Engine API passes blocks of Unicode text to the Rosette Linguistics Platform and accepts back words to index. |
 |
For more details on how the two products work together, including a chart detailing the different steps involved in the dtSearch Engine and Rosette API integration, please see dtSearch and Rosette Full-Featured International Search PDF white paper. |
|
| |
|
| |
|
 |
The
dtSearch product line can instantly search
terabytes of text across a desktop, network,
Internet or Intranet site. |
dtSearch
products also serve as tools for publishing,
with instant text searching, large document
collections to Web sites or CD/DVDs. |
 |
over
two dozen indexed, unindexed, fielded and full-text
search options |
 |
highlights
hits in HTML, XML and PDF, while displaying
embedded links, formatting and images |
 |
converts
other file types — word processor, database,
spreadsheet, email and full-text of email attachments,
ZIP, Unicode, etc. — to HTML for display
with highlighted hits |
 |
built-in Spider adds
a third-party or other Web site (public, secure
content, password accessible, etc.) to your searchable
database |
 |
Spider supports
Web-based content (HTML, PDF, XML, etc.) as well
as dynamically-generated content (ASP.NET, MS CMS,
SharePoint, etc.) |
| General
supported file types |
| SQL
and similar data sources |
|
|
|