Source: Eiffel Software Blog

Eiffel Software Blog ZSTRING released in Eiffel-Loop 1.3.1

Contents1 Preface2 What is ZSTRING?3 How is ZSTRING different from STRING_8?4 How ZSTRING Works4.1 The unencoded array4.2 The Default Codec5 The ZSTRING Test Suite6 ZSTRING for Asian text6.1 A 2-byte Solution6.2 A 4-byte Solution7 Benchmark Notes7.1 Benchmark Source Code links on Github7.1.1 Pure Latin-15 encoding7.1.2 Mixed Latin-15 and Unicode encoding7.2 The Test Data7.3 Memory Consumption7.4 Runtime Performance7.5 I Ching Hexagram Test Strings8 ZSTRING argument options9 Extra ZSTRING routines10 Source Code LinksPreface Since 2012 I have been interested in the idea of creating a memory efficient alternative to STRING_32 that does not sacrifice performance. In 2013 I published this article which describes and benchmarks the class EL_ASTRING (aka ASTRING). While ASTRING peformed well against STRING_32 and the Gobo UC_UTF8_STRING in terms of both memory consumption and performance, it had one fundamental problem which I felt would prevent it from ever gaining wide acceptance. This was that each ASTRING instance was limited to a maximum of 26 unicode characters that could be used to augment the base latin character set.In the fall of 2015 I came up with a better algorithm which did not suffer the limitations of ASTRING. Although the new design was somewhat less memory efficient for storing characters wider than 8-bits, it performed well overall. I decided to call it ZSTRING. In November 2015 I announced work in progress on ZSTRING in this article.I am happy to announce that this project is now completed and since January I have migrated all my projects to use ZSTRING. ZSTRING is now available to download as part of the latest 1.3.1 release of Eiffel-Loop.Finnian ReillyWhat is ZSTRING? ZSTRING is a memory efficient alternative to the standard STRING_32 class. If your application processes text that is mainly written in a language that is encodeable by a latin character set, you can expect up to a 75% saving in memory. There is a small trade off in performance, but not by much as the benchmarks show. Actually there is not always a trade off. Some ZSTRING string operations are slightly faster than their STRING_32 equivalent.How is ZSTRING different from STRING_8? With STRING_8 you are limited to characters from the ISO-8859-1 character set. A ZSTRING instance can be initialized from any STRING_32 string and you are guaranteed to always get back the same STRING_32 string with a call to the function {ZSTRING}.as_string_32. You can also convert it to UTF-8 with a call to {ZSTRING}.to_utf_8.How ZSTRING Works ZSTRING uses a combination of two different character arrays to represent the text. The first array is defined as:area: SPECIAL [CHARACTER_8]When initialized from a STRING_32, area holds every character that is encodeable using a system defined latin character-set (ISO-8859-15 by default). Any unicode character that is not encodeable as latin, is denoted by the control code 26 'SUB' character. SUB stands for substitution.The unencoded array The second array is defined as:unencoded_area: SPECIAL [NATURAL_32]This array is used to store any characters that are not encodeable by the latin codec. The array is structured as a sequence of substrings. The first two cells of each substring record the indices defining the substring interval. As an example to illustrate, consider the string:"#61 Zhōng Fú 中孚 Inner Truth"It contains a total of 27 characters. 24 of these are encodeable as Latin-15 and 3 are not. The three that are not encodeable are stored in the unencoded_area array as shown in the following diagram:The olive colored cells store the interval indices of the 2 substrings "ō" and "中孚".The Default Codec By default, ZSTRING uses the codec class EL_ISO_8859_15_ZCODEC to convert unicode strings. However you can override this choice by calling the routine:{EL_SHARED_ZCODEC}.set_system_codecFor a desktop application you could set the system codec according to the user locale. The following class offers a convenient way to create codec instances from an integer arguement:class EL_SHARED_ZCODEC_FACTORYinherit EL_FACTORY_CLIENTfeature {NONE} -- Factory new_iso_8859_codec (id: INTEGER): EL_ISO_8859_ZCODEC require not_iso_8859_12: id /= 12 valid_id: (1 |..| 15).has (id) do Result := ISO_8859_factory.instance_from_type (ISO_8859_codec [id], agent {EL_ISO_8859_ZCODEC}.make) endThe ZSTRING Test Suite ZSTRING has a comprehensive test suite that verifies the results of all string operations against STRING_32.See ZSTRING_TEST_SET also TEST_STRING_CONSTANTSZSTRING for Asian text If your application is mainly processing Asian text, it is obvious that the current implementation of ZSTRING will incur a significant penalty in terms of both memory and runtime performance. The solution to this problem is to use a different source compatible implementation of ZSTRING adapted for Asian text that can be set with an ECF option.A 2-byte Solution The solution is in inherent in the fact that most Asian characters will fit into 2 bytes. By changing the storage arrays to the following definition it should be possible to create an efficient ZSTRING for Asian textarea: SPECIAL [NATURAL_16]unencoded_area: SPECIAL [NATURAL_32]Now the criteria for whether the character is stored in the unencoded_area is if it does not fit into 2-bytes. As it is unlikely that you will have many consecutive runs of 3 or 4-byte characters, the current scheme of using 2 cells for interval onsets and offsets makes no sense. Instead we should assume that over-sized characters occur in isolated instances and we only need to store the index. However some research is required to verify these assumptions.Ideally area should be defined as CHARACTER_16, but as this is not part of standard Eiffel, NATURAL_16 can be used as a workaround.Perhaps in the future some Asian coder will be interested to take on the challenge of implementing this.A 4-byte Solution Of course an easier option is to create a source compatible version of ZSTRING that uses the same implementation as STRING_32.Benchmark Notes You can find a set of benchmarks at the following URL comparing ZSTRING and STRING_32 in terms of memory consumption and runtime performance.http://www.eiffel-loop.com/benchmarks/ZSTRING-benchmarks-latin-15.htmlBenchmark Source Code links on Github ZSTRING_BENCHMARK_APPPure Latin-15 encoding STRING_BENCHMARKZSTRING_BENCHMARKSTRING_32_BENCHMARKMixed Latin-15 and Unicode encoding MIXED_ENCODING_STRING_BENCHMARKMIXED_ENCODING_ZSTRING_BENCHMARKMIXED_ENCODING_STRING_32_BENCHMARKThe Test Data The test data used in these benchmarks is shown at the bottom of the benchmark page. The data is composed of 64 rows and 4 columns referenced as A, B, C, D. The Input column indicates which combination of columns have been selected for each test. For example "$A $D" means that each input string is a string from column A joined with a string from column D. Each runtime performance test is repeated for each of the 64 rows.The columns have different storage characteristics as follows:Columns A and D consist solely of characters that can be encoded as Latin-15. Column B is composed of characters, some of which will need to be stored in the unencoded_area array. Column C is composed of characters, all of which will need to be stored in the unencoded_area array.Memory Consumption Table 1 shows memory consumption under optimal conditions, where all the text data can be encoded as Latin-1.Table 2 shows memory consumption under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array.Runtime Performance Runtime performance is benchmarked using the 25 most common string operations. The performance figures are the average of 1000 runs for each benchmark test operation. The results are sorted in order of descending ZSTRING performance rank. Table 3 shows runtime performance under optimal conditions, where all the text data can be encoded as Latin-1. Table 4 shows performance under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array. Table 4 has more input permutations than table 3.I Ching Hexagram Test Strings Table 5 at the bottom of the page shows the data used for the benchmarks. The data are names and descriptions of the 64 I Ching hexagrams divided into 4 columns as follows: A. Hexagram number B. Chinese names as pinyin phonetic spelling C. Chinese names as chinese characters D. English descriptionZSTRING argument options Most ZSTRING routines have two alternate forms as follows:taking arguments of type EL_READABLE_ZSTRINGtaking arguments of type READABLE_STRING_GENERALFor example:append_stringappend_string_generalExtra ZSTRING routines ZSTRING has a number of extra routines not found in the standard Eiffel string classes.substituted_tuple alias "#$" (inserts: TUPLE): like CurrentThis routine is inspired by the Python string formatting or interpolation operator '%'. It is not quite as sophisticated as the Python operator as it cannot do numeric formatting, but nevertheless is very useful. It means any ZSTRING can be used as an interpolation template with the addition of '%S' markers (AKA '#').translate (old_characters, new_characters: EL_READABLE_ZSTRING)This routine is inspired by the Python method string.translate(s, table[, deletechars]). ZSTRING has a number of different variations of this routine one of which is able to do character deletion.substring_index_list (str: EL_READABLE_ZSTRING): like internal_substring_index_listThis returns a list of substring indices as an arrayed list.substring_intervals (other: EL_READABLE_ZSTRING): EL_SEQUENTIAL_INTERVALSThe class EL_SEQUENTIAL_INTERVALS is a high performance alternative to using ARRAYED_LIST [INTEGER_INTERVAL] both in terms of memory consumption and runtime performance.escaped (escaper: EL_CHARACTER_ESCAPER [EL_ZSTRING]): like CurrentReturns an escaped copy of current stringunescape (escape_table: EL_ESCAPE_TABLE)Unescapes

Read full article »

Eiffel Software's Competitors | Eiffel Software's News | Eiffel Software's Financials

Followers on Owler

Est. Annual Revenue

$100K-5.0M

Est. Employees

25-100

CEO

Update CEO

CEO Approval Rating

- -/100