How To Strip Hebrew Accent Marks
The trick here is to realize that these characters that you see in the question with the "accents" aren't really the characters (i.e. "These aren't the droidscharacters you are looking for" ;-) ). The "accents" are various types of notations indicating things like:
vowels (lines and dots that are typically under the letters):
base letter "ה" = "h"; "הֶ" = "heh" and "הָ" = "hah"
pronunciation (dots that are usually inside or above letters):
"בּ" = "b" vs "ב" = "v", or "שׂ" = "s" vs "שׁ" = "sh"
punctuation
- cantillation (how it should be sung)
The actual Hebrew letters are what is shown in the stripped down version (i.e. the end result of what is being requested here). What we are referring to here as "accents" are known as diacritical marks. The Wikipedia article on Hebrew diacritics has a lot of good information about these marks, including the following image and caption:
Gen. 1:9 And God said, "Let the waters be collected".
Letters in black, pointing in red, cantillation in blue
Getting from those base characters to what the first line (with the vowels, etc) shows is a matter of adding one or more "accents". Unicode (UTF-16 in SQL Server, though default interpretation only handles the UCS-2 / Basic Multilingual Plane (BMP) code points) allows for some characters to overlay another non-overlay character when adjacent to them. These are known as Combining Characters.
Meaning:
SELECT DATALENGTH(N'מַ֖'); -- character taken from original given text
Returns:
6
not 2
as most people would expect from seeing a single, double-byte character. So maybe we try to find what character is there by doing:
SELECT UNICODE(N'מַ֖');
which returns:
1502
Of course, the UNICODE
and ASCII
functions only return the INT
value of the first character of whatever string they are given. But a value of 1502 only covers 2 bytes, which leaves 4 bytes unaccounted for. Looking at the binary/hex values of that same Hebrew "character":
SELECT NCHAR(1502), CONVERT(BINARY(2), UNICODE(N'מַ֖')), CONVERT(VARBINARY(10), N'מַ֖');
we get:
מ
0x05DE 0xDE05B7059605
Now, 0x05DE is the hex representation of 1502, and the 1502 is only the "מ". The next part can be separated into three 2-byte sets: DE05 B705 9605. Now, Unicode string values are stored in Little Endian, which means the byte-order is reversed. If we switch each of those three sets we get:
05DE (the base character) 05B7 0596 (the unaccounted for 4 bytes).
Ok. So what happens if we remove that base character?
SELECT REPLACE(N'מַ֖' COLLATE Hebrew_BIN2, NCHAR(1502) COLLATE Hebrew_BIN2, '');
That returns the two remaining characters (not easy to see here so I have made the following line a header in order to increase the font size; you can also run the above REPLACE
to see them):
Removing the מ from the מַ֖ leaves two characters at the bottom: ַ֖
Hence, we need to strip out each individual code-point that is one of these "extra" combining characters (found at: http://unicode-table.com/en/search/?q=hebrew) and that will leave us with the base characters. We can do that via:
CREATE FUNCTION dbo.RemoveHebrewAccents (@txeTwerbeH NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN
WITH base (dummy) AS
(
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), nums AS
(
-- we will want to generate code points 1425 - 1479
SELECT TOP (55) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [Num]
FROM base b1
CROSS JOIN base b2
)
SELECT @txeTwerbeH = REPLACE(
@txeTwerbeH COLLATE Hebrew_BIN2,
NCHAR(1424 + nums.[Num]) COLLATE Hebrew_BIN2,
''
)
FROM nums;
RETURN @txeTwerbeH;
END;
And then we can test it with the original text as follows:
DECLARE @Hebrew NVARCHAR(200) = N'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ';
SELECT dbo.RemoveHebrewAccents(@Hebrew);
Returns:
בראשית ברא אלהים את השמים ואת הארץ
Additional notes:
Technically, there is a set of code-points between 64298 and 64334 that do have some vowels and pronunciation "accents" built into the character. If those need to be handled, that can be a second step in the function to do a simple replacement of those characters.
It seems that these accent, punctuation, etc code-points only match when using a binary collation. Even using
Hebrew_100_CS_AS_KS_WS_SC
did not match them. But the following did work:Hebrew_BIN
,Hebrew_BIN2
,Latin1_General_BIN
, andLatin1_General_BIN2
. In the function I ended up usingHebrew_BIN2
. Please note that when using binary collations, unless you have a specific need to use the older_BIN
collations, you should only be using the newer_BIN2
collations.For anyone who is curious, the Hebrew sample text is actually Bereishis 1:1 (that is also the first word on the right-side as Hebrew is read right-to-left; in English it would be "Genesis 1:1" though that is not a direct translation of the word, just the name of the first book of the Torah / Bible; the direct translation is "in the beginning"):
In the beginning of God's creating the heavens and the Earth
2015-01-19: I found some great resources which explain both Combining Characters and the Hebrew character set:
- http://en.wikipedia.org/wiki/Combining_character (also linked towards the top of this answer)
- http://www.alanwood.net/unicode/combining_diacritical_marks.html
- http://www.alanwood.net/unicode/hebrew.html
- http://people.w3.org/rishida/scripts/pickers/hebrew/ (play with all of the characters in the Hebrew character set; very cool!)