Search Engine

About the built-in search engine

Publications or ebooks created with HTML Executable come with a built-in search engine that allows end users to search for specific words or expressions through the entire publication in seconds. When compiling your publication or ebook, HTML Executable parses all HTML pages and PDF documents, collecting keywords from them. These keywords are then indexed and the result is stored in the publication's data. Since keywords are indexed, it only takes seconds for a search query to be completed.

HTML Executable indicates the number of pages and unique words that were found while indexing pages in the compilation log.

End users can access the search panel by clicking the "Search" button or by selecting "Navigate|Show Search" in the application.

Configuring the search engine

Enabling the search engine can result in a larger EXE output file (it depends on the number of HTML pages and PDF documents you compile). If you, therefore, do not want to include a search engine in your publication, then turn the following option on: "Disable the search engine".

PDF documents can be indexed too, even if the built-in PDF viewer is deactivated.

When a search is complete, the application lists the results. Each result displays the page's title and an extract with the found keyword(s) or expression(s).

Search results are automatically sorted by relevance. To do this, HTML Executable counts the number of occurrences of the search terms in each page, then assigns a percentage of relevance.

Some keywords may be automatically excluded from the index so they won't give any result if end users search for them. In addition to some common words, you may add your own sensitive keywords to the exclusion list. Just press Add and specify the keyword to add. On the contrary, you can remove keywords from the exclusion list by selecting them and clicking Remove. Keyword exclusion lists may be imported/exported from/to XML files using the XML Tools button, so you can edit them manually using any XML editor.

Language Support

The search engine supports multiple languages, enhancing the search experience for various kinds of content. You are required to specify the language of your HTML and PDF files to optimize the search engine for that particular language. Especially, certain commonly used keywords in the language, known as "stop words", are excluded by default from the search. These words are typically short, functional words such as "and", "the", and "of" in English, which are often filtered out as they occur so frequently that including them could skew the relevance of the search results. These stop words are defined in a supplementary JavaScript file to lunr.js and are automatically selected based on the user's chosen language. This approach ensures more accurate and language-specific search functionality in the compiled eBooks or applications.

For advanced users, the JavaScript language files are available here if you wish to view their contents:

C:\Program Files (x86)\HTML Executable 2023\Resources\JavaScript

Indexing content available only inside the following HTML tag

HTML Executable's search engine functionality allows you to index the content of HTML pages specifically within certain tags. By default, it indexes the content within the 'body' tag. However, if you're using a template with various frames where the same words appear on all pages of your website, this can skew the search results. To address this, you are encouraged to specify the name of the tag that contains unique content on each page. For instance, suppose your website contains content within a 'div' HTML tag with the ID 'content'. In this case, you would input `<div id="content"` into HTML Executable. This way, HTML Executable will only index the content enclosed between the `<div id="content">` tag and its corresponding closing `</div>` tag. This feature provides a more accurate and focused search functionality by indexing only the relevant content on each page.

Support for Unicode

The search engine is Unicode-enabled. When parsing HTML pages, HTML Executable takes account of the encoding format and the charset defined in HTML documents. All keywords are natively converted and stored in UTF-8 format.

If no charset is defined in an HTML file, you can specify the default HTML charset that should be used (by default, UTF-8).

About searches

HTML Executable uses lunr.js as its search engine. According to its documentation:

At the most basic level, search queries can consist of a single term, like 'hello'. However, they can also include multiple terms, which are joined with an OR operator. For example, the query 'hello world' will retrieve documents containing either 'hello' or 'world', but documents containing both will rank higher.

You can add wildcards to terms to represent one or more unspecified characters. These wildcards can be positioned anywhere within the term and a term can contain more than one wildcard. While this broadens the range of documents found, it can negatively affect query performance, especially when a wildcard is placed at the beginning of a term.

By default, when the end user types a query, HTML Executable immediately starts searching for it (a wildcard is always added to the end of the query entered).

You can limit terms to specific fields. For instance, with 'title:hello', only documents with 'hello' in the title field will match. Using a field that isn't in the index will result in an error.

HTML Executable's search facility supports modifiers for terms, including edit distance and boost. Boosting a term (e.g., 'foo^5') increases the ranking of documents matching that term. Edit distance enables fuzzy matching—for example, 'hello~2' will match documents containing 'hello' within an edit distance of 2. To improve query performance, it's best to avoid large values for edit distance.

Terms can have a presence modifier. By default, the presence of a term in a document is optional, but you can make it required or prohibited. Prefix the term with '+' to require its presence (e.g., '+foo bar' searches for documents that must contain 'foo' and may contain 'bar'). Prefix with '-' to prohibit its presence (e.g., '-foo bar' searches for documents that cannot contain 'foo' but may contain 'bar').

To escape special characters, use the backslash ('\'). This allows you to include characters in searches that would typically be viewed as modifiers. For instance, 'foo\~2' will search for the term "foo~2" instead of trying to apply a boost of 2 to the search term "foo".

When a page from a search result is opened, keywords that were searched for may be highlighted. For PDF documents, keywords are highlighted too.

Customizing the display of search results

For advanced programmers, it's perfectly feasible to modify the HTML code and JavaScript scripts that enable the HTML Executable's search engine to function. Indeed, all these HTML and JavaScript resources can be found in the zip file named chromium.zip, which is available in the HTML Executable installation directory, usually located at:

C:\Program Files (x86)\HTML Executable 2023\Resources\Chromium\

The search engine's functionality can also be customized by modifying the JavaScript code found in assets\js\search.js within the Chromium.zip file.

However, it's important to note that only experienced users should attempt these modifications. It's essential to make backups before making any changes to prevent loss of original files or data.

Large search index

If you get an "out of memory" error while compiling your publication, try to enable the Keep the search index data outside the EXE file option available in Output Format. The error means that you have reached the free memory limit available for 32-bit programs (2 GB). In that case, HTML Executable cannot store your search index in memory, and must store it as a file on the hard disk.