How to store the EXACT UTF-8 data in SQL Server 2008?
-
jueves, 15 de marzo de 2012 10:56Hello sir/mam,
We have web applications in sharepoint with Phonetic features dictionary using special character. In our web applications which is done in sharepoint, we can input, retrieve and display properly the PHONETIC special character.
But now, we have to input or enter 42000 plus phonetic stored in excel file. We don't want to do it manually using our web applications in sharepoint. So we used PHP and SQLSRV 2.0 for this task to migrate the 42000 plus phonetic directly to our database in SQL Server 2008.
PHP Code:
header( 'Content-Type: text/html; charset=UTF-8' );
mb_internal_encoding( 'UTF-8' );
$file_handle = fopen ($file, "r");
$text = fgets($file_handle, 4096);
$text = explode("\t", $text);
$Fonetik = $text[0];
fclose($file_handle);
$connectionInfo = array("UID"=>$username, "PWD"=>$password, "Database"=>$db, "CharacterSet" => "UTF-8");
$sqlconnect = sqlsrv_connect($server, $connectionInfo);
$sqlquery = "INSERT INTO $table (Fonetik) VALUES (?)";
$sqlparam = array($Fonetik, SQLSRV_PARAM_IN, SQLSRV_PHPTYPE_STRING('UTF-8');
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery, $param1);
Excel File: [mᴧŋasaˀ]
Text file (UTF-8 encoding): [mᴧŋasaˀ]
PHP (without charset): [mᴧŋasaˀ]
PHP (with charset utf8): [mᴧŋasaˀ]
We only used PHP to store or migrate the data to SQL Server 2008, instead of manually doing it one by one the 42000 plus phonetic in excel using our web applications in sharepoint.
the PHP successfully stored the UTF-8 data to SQL Server 2008. But when we used our web applications in sharepoint and even in SQL Server Studio Management, the data is different. The value is [mᴧŋasaˀ], like PHP display without the charset set to UTF-8.
What we want is to store the phonetic data exactly it display in Excel File, Text File and even in PHP with charset set to UTF-8. How do we store this data to SQL Server to look like this in SQL Server 2008 exactly like this [mᴧŋasaˀ]?
If we use PHP to retrieve back the data we stored, it display correctly as long as we set the charset to UTF-8. But if we didn't set it to UTF-8, SQL Server 2008 and our PHP and also our Web Applications display this data [mᴧŋasaˀ].
Hope someone can give us some more information what we did wrong.
Thanks a lot...
- Editado Allan Ortega jueves, 15 de marzo de 2012 10:57
- Editado Allan Ortega jueves, 15 de marzo de 2012 10:58
Todas las respuestas
-
jueves, 15 de marzo de 2012 12:39
Hi Allan
It looks like you have created a tab-separated CSV file from Excel?
Previously you said "Table collation is "SQL_Latin1_General_CP1_CI_AS" and Column is NVARCHAR." Your result looks like it isn't NVARCHAR, and I notice you have removed that information from this post.
Here are some basic tips:
- You stated your text file is UTF-8 encoded. If it isn't you must convert it to UTF-8 before saving.
- Your database column 'Fonetik' *must* be of type NVARCHAR - any client that wants to read this data will not be able to otherwise, like you're seeing in Sharepoint.
- Don't call 'mb_internal_encoding()', it's not needed in your case.
- To read a CSV file in PHP use 'fgetcsv()'. It's great.
- Your php has a naming error, but this is all you need:
// $sqlparam= array($Fonetik, SQLSRV_PARAM_IN, SQLSRV_PHPTYPE_STRING('UTF-8');
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery, array($Fonetik));
Assuming your column is NVARCHAR, this works:
// Don't need: mb_internal_encoding( 'UTF-8' ); $file_handle = fopen ($file, "r"); $text = fgetcsv($file_handle, 4096, "\t"); $Fonetik = $text[0]; // or... $Fonetic = mb_convert_encoding($text[0], 'UTF-8', 'whatever'); fclose($file_handle); $connectionInfo = array("UID"=>$username, "PWD"=>$password, "Database"=>$db, "CharacterSet" => "UTF-8"); $sqlconnect = sqlsrv_connect($server, $connectionInfo); $sqlquery = "INSERT INTO $table (Fonetik) VALUES (?)"; $sqlresult = sqlsrv_query($sqlconnect, $sqlquery, array($Fonetik));If your column is VARCHAR SQL_Latin1_General_CP1_CI_AS, and your file is UTF-8, you must convert your text in PHP into characters it likes. E.g.
$file_handle = fopen ($file, "r"); $text = fgetcsv($file_handle, 4096, "\t"); $Fonetik = iconv('UTF-8', 'Windows-1252', $text[0]); // might fail! fclose($file_handle); $connectionInfo = array("UID"=>$username, "PWD"=>$password, "Database"=>$db, "CharacterSet" => SQLSRV_ENC_CHAR); $sqlconnect = sqlsrv_connect($server, $connectionInfo); $sqlquery = "INSERT INTO $table (Fonetik) VALUES (?)"; $sqlresult = sqlsrv_query($sqlconnect, $sqlquery, array($Fonetik));
Rob
- Editado Robert Johnson jueves, 15 de marzo de 2012 12:48 Strike tag deprecated
- Editado Robert Johnson jueves, 15 de marzo de 2012 12:52
- Marcado como respuesta Allan Ortega sábado, 17 de marzo de 2012 2:40
-
jueves, 15 de marzo de 2012 13:47
Hello Robert,
Thanks again for the reply, it is i'm desperate now, since our last option is to insert all the data from Excel file then edit manually the Fonetik special character on our web applications sharepoint. Which is for us very time consuming than if we can able to successfully insert the data.
Yes, i tried the mb_convert_encoding when i used the CSV comma delimited then used fgetcsv. But soon realize that other fields in the excel actually uses a comma. So i opted for the TEXT UNICODE TAB DELIMITED. Then in notepad i save as again to TEXT UNICODE UTF-8. Now i don't have to used mb_convert_encoding. The result of both is correct and now i can read those field with comma.
The PHP retrieve the data in TEXT UNICODE UTF-8 file correctly and only need to do is set the CharSet to UTF-8 then the data in FONETIK fields display in PHP correctly.
Yes, again on the FONETIK fields in SQL Server 2008, it is NVARCHAR and the TABLE is "SQL_Latin1_General_CP1_CI_AS". I tried the ICONV and MB_CONVERT_ENCODING before when i'm doing some test with FONETIK special character. Both display the same result. Tried several kind of unicode to convert only ends up with different character and sometimes chinese character.
i will try this commands:
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery, array($Fonetik));
is this correct, because i have other fields needs to insert?
$sqlquery = "INSERT INTO $table (Noun, Fonetik, Definition) VALUES (?, ?, ?)"; $sqlresult = sqlsrv_query($sqlconnect, $sqlquery, array($Noun, $Fonetik, $Definition));
Thank you so much. I do really appreciate it.
-
jueves, 15 de marzo de 2012 14:52
Great.
The problem I have is that your code should have worked from the beginning, there is nothing wrong with it. Now that your input file is UTF-8, there is no reason it shouldn't work.
Anyway, here's a bigger example, designed to run as a web page (hence 'header()' statement at the top), but will also work as cli.
I have not tested this, it's all theory:
<?php header('Content-Type: text/plain; charset=UTF-8'); echo "Converting file...\r\n"; $rownum = 0; $inserted = 0; $file_handle = fopen ($file, "r"); $connectionInfo = array("UID"=>$username, "PWD"=>$password, "Database"=>$db, "CharacterSet" => "UTF-8"); if ( ($sqlconnect = sqlsrv_connect($server, $connectionInfo)) ) { $sqlquery = "INSERT INTO $table (Noun, Fonetik, Definition) VALUES (?,?,?)"; $params = array(&$Noun, &$Fonetik, &$Definition); if ( ($stmt = sqlsrv_prepare($sqlconnect, $sqlquery, $params)) ) { while ( ($text = fgetcsv($file_handle, 4096, "\t")) ) { ++$rownum; if ( count($text) > 2 ) { $Noun = $text[0]; $Fonetik = $text[1]; $Definition = $text[2]; if ( !sqlsrv_execute($stmt) ) { echo "SQL Server insert error at row $rownum: " . print_r(sqlsrv_errors(), true); } else { $inserted += sqlsrv_rows_affected($stmt); } } else { echo "Bad row in file : $rownum\r\n"; } } if ( $text === false ) { echo "End of file.\r\n"; } else { echo "Unexpected result from fgetcsv at line $rownum.\r\n"; } } else { echo "SQL Server prepare error: " . print_r(sqlsrv_errors(), true); } } else { echo "SQL Server connect error: " . print_r(sqlsrv_errors(), true); } fclose($file_handle); echo "End of script (Rows:$rownum, Inserted:$inserted)\r\n"; ?>Rob
- Marcado como respuesta Allan Ortega sábado, 17 de marzo de 2012 2:40
-
jueves, 15 de marzo de 2012 19:09
I thought about this and decided to try it out for myself. It's easy for things to go wrong when converting the input-file into the correct encoding.
This script can run with only a few changes as long as you have a SQL Server and a database to test it on.
Save a text file as Unicode from Excel and use it in this script - it will be converted automatically, and it works:
<?php header('Content-Type: text/plain; charset=UTF-8'); $file='.\tabcsv1.txt'; // save any tab-delimited text file from Excel with at least 3 columns and put it in your script folder. $db = 'tests'; // Put the name of your database $table = 'codes'; // table name - will get created automatically if it's not there $server = '.\SQLEXPRESS'; // SQL Server $username = ''; // leave blank to use trusted authentication. $password = ''; $rownum = 0; $inserted = 0; function detect_encoding($contents) { if ( substr($contents, 0, 3) == "\xEF\xBB\xBF" ) return 'UTF-8'; switch ( substr($contents, 0, 4) ) { case "\x00\x00\xFE\xFF" : return 'UTF-32BE'; case "\xFF\xFE\x00\x00" : return 'UTF-32LE'; } switch ( substr($contents, 0, 2) ) { case "\xFE\xFF" : return 'UTF-16BE'; case "\xFF\xFE" : return 'UTF-16LE'; } // now detect sbcs encodings. UTF-8 repeated because file might not have BOM marker. // Bear in mind that mb_detect_encoding is almost useless, it cannot detect // Windows ANSI, therefore no point placing iso-8859-1 in the list, it will // often be a false positive on Windows. if ( ($rv = mb_detect_encoding($contents, "UTF-8", true)) ) return $rv; return 'Windows-1252'; // ANSI - my computer's default, used by Excel. } echo "Converting file: $file...\r\n"; if ( ($contents = file_get_contents($file)) ) { $fenc = detect_encoding($contents); if ( $fenc ) { echo "File encoding is $fenc\r\n"; if ( strcasecmp($fenc, 'UTF-8') ) { echo "Converting the file to UTF-8 encoding..."; if ( ($contents = mb_convert_encoding($contents, 'UTF-8', $fenc)) ) { $pi = pathinfo($file); if ( $pi['dirname'] ) $pi['dirname'].='\\'; $converted_file = "{$pi['dirname']}{$pi['filename']}_converted_from_{$fenc}.{$pi['extension']}"; if ( file_put_contents($converted_file, $contents) ) { $file = $converted_file; echo "Converted " . strlen($contents) . " bytes.\r\n"; } else { echo "File conversion failed.\r\n"; } } else { echo "Character conversion failed.\r\n"; } } } else { echo "Failed to detect file encoding. This script will probably fail.\r\n"; } } $file_handle = fopen ($file, "r"); $connectionInfo = array("Database"=>$db, "CharacterSet" => "UTF-8"); // if user name is blank, use trusted connection to SQL Server (Windows login) if ( $username ) { // use SQL Server login $connectionInfo["UID"] = $username; $connectionInfo["PWD"] = $password; } if ( ($sqlconnect = sqlsrv_connect($server, $connectionInfo)) ) { /** Create a table for our test **/ $create = <<<SQL IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[$table]') AND type in (N'U')) BEGIN CREATE TABLE [dbo].[$table]( [Noun] [nvarchar](20) NOT NULL , [Fonetik] [nvarchar] (50) not null, [Definition] nvarchar(260) NOT NULL) END; TRUNCATE TABLE $table; SQL; if ( ($stmt=sqlsrv_query($sqlconnect, $create)) ) { while ( sqlsrv_next_result($stmt) ) ; sqlsrv_free_stmt($stmt); } else { echo "SQL Server error creatiing table $table: " . print_r(sqlsrv_errors(), true); } $sqlquery = "INSERT INTO $table (Noun, Fonetik, Definition) VALUES (?,?,?)"; $params = array(&$Noun, &$Fonetik, &$Definition); if ( ($stmt = sqlsrv_prepare($sqlconnect, $sqlquery, $params)) ) { while ( ($text = fgetcsv($file_handle, 4096, "\t")) ) { ++$rownum; if ( count($text) > 2 ) { $Noun = $text[0]; $Fonetik = $text[1]; $Definition = $text[2]; if ( !sqlsrv_execute($stmt) ) { echo "SQL Server insert error at row $rownum: " . print_r(sqlsrv_errors(), true); } else { $inserted += sqlsrv_rows_affected($stmt); } } else { echo "Bad row in file : $rownum\r\n"; } } if ( $text === false ) { echo "End of file.\r\n"; } else { echo "Unexpected result from fgetcsv at line $rownum.\r\n"; } } else { echo "SQL Server prepare error: " . print_r(sqlsrv_errors(), true); } } else { echo "SQL Server connect error: " . print_r(sqlsrv_errors(), true); } fclose($file_handle); echo "End of script (Rows:$rownum, Inserted:$inserted)\r\n"; ?>Rob
- Marcado como respuesta Allan Ortega sábado, 17 de marzo de 2012 2:40
-
viernes, 16 de marzo de 2012 3:49
Hi Robert,
Sorry for the late reply, will try your suggestion later, unfortunately i don't have access to our system right now.
In the meantime i will show you the file i used. the text file from excel save as TEXT UNICODE (*.TXT). Then open it in notepad then save as again to TEXT DOCUMENTS (*.TXT) ENCODING: UTF-8.
Here is the current PHP file code:
<?php
header( 'Content-Type: text/html; charset=UTF-8' );
$file_handle = fopen ($file, "r");
$text = fgets($file_handle, 4096);
$text = explode("\t", $text);
$Type = bin2hex($text[0]);
$Lema = bin2hex($text[1]);
$Fonetik = bin2hex($text[2]);
$Malayu = bin2hex($text[3]);
echo "$Type -- $Lema -- $Fonetik -- $Malayu<br />";
while (!feof($file_handle) )
{
$text = fgets($file_handle, 4096);
$text = explode("\t", $text);
$Type = $text[0];
$Lema = $text[1];
$Fonetik = $text[2];
$Malayu = $text[3];
echo "From File: " . bin2hex($Lema) . " -- $Lema<br />";
echo "From File: " . bin2hex($Fonetik) . " -- $Fonetik<br />";
echo "From File: " . bin2hex($Malayu) . " -- $Malayu<br />";
$Lema = utf8_decode($Lema);
$Lema = str_replace("'","''",$Lema);
$Malayu = utf8_decode($Malayu);
$Malayu = str_replace("'","''",$Malayu);
$connectionInfo = array("UID"=>$username, "PWD"=>$password, "Database"=>$db, "CharacterSet" => "UTF-8");
$sqlconnect = sqlsrv_connect($server, $connectionInfo);
$sqlquery = "INSERT INTO $table (Lema, Fonetik, Malayu) VALUES (?,?,?)";
$sqlparam = array($Lema, $Fonetik, $Malayu);
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery, $sqlparam);
}
fclose($file_handle);
sqlsrv_free_stmt($sqlresult);
sqlsrv_close ($sqlconnect);
?>
Text File content:
Type Lema Fonetik Malayu
1 asa'; mangasa' [mᴧŋasaˀ] menyiang, (ikan)
PHP result (charset to UTF-8):
efbbbf54797065 -- 4c656d61 -- 466f6e6574696b -- 4d616c6179750d0a
From File: 617361273b206d616e6761736127 -- asa'; mangasa'
From File: 5b6de1b4a7c58b617361cb805d -- [mᴧŋasaˀ]
From File: 6d656e7969616e672c2028696b616e29 -- menyiang, (ikan)
PHP result (no charset):
efbbbf54797065 -- 4c656d61 -- 466f6e6574696b -- 4d616c6179750d0a
From Trim: 617361273b206d616e6761736127 -- asa'; mangasa'
From Trim: 5b6de1b4a7c58b617361cb805d -- [mᴧŋasaˀ]
From Trim: 6d656e7969616e672c2028696b616e29 -- menyiang, (ikan)
But on our Web Applications Sharepoint and SQL Server Studio Management the data in FONETIK is this [mᴧŋasaˀ].
thank you very much. will try you suggestion code tomorrow and will let you know the outcome. -
viernes, 16 de marzo de 2012 8:26
Allan, your code changes every time you post it.
'utf8_decode' converts from utf-8 to iso-8859-1. Why are you doing that?
Then you're tsql-escaping the text with 'str_replace', and also passing it as a parameter to sqlsrv_query. That's also wrong.
I don't have any more time at the moment, but the example posted above should work for you.
Rob
-
viernes, 16 de marzo de 2012 11:53Hello Robert,
Very sorry for the confusion. The PHP code i posted is the TEST.PHP i created if i successfully store data from TEXT UNICODE UTF-8 file. I used only one data.
The other fields is just ordinary text without any special character but some data has ' on it. which i need to escape by doing replacing the single ' to double ''. That is why i used utf8_decode then str_replace for that since it was a UTF-8 string. But the only fields we are after is the FONETIK which have special character.
Then if the FONETIK is successfully stored in SQL Server 2008. Then we will store all the remaining data.
Very sorry again for any confusion. Thank you very much. -
sábado, 17 de marzo de 2012 2:40
Hello Robert,
Thank you very much, ITS WORK... the only changes made to my PHP is this...
from:$sqlquery = "INSERT INTO $table (Lema, Fonetik, Malayu) VALUES ('$Lema', '$Fonetik', '$Malayu')";
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery);
to:
$sqlquery = "INSERT INTO $table (Lema, Fonetik, Malayu) VALUES (?,?,?)";
$sqlparam = array($Lema, $Fonetik, $Malayu);
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery, $sqlparam);
Thank you so much. FONETIK fields show exactly the way in EXCEL. Web Applications Sharepoint and even the SQL 2008 Server shows correctly.
I know i said thanks many times but, THANK YOU.
- Editado Allan Ortega sábado, 17 de marzo de 2012 2:42
- Editado Allan Ortega sábado, 17 de marzo de 2012 2:42
-
sábado, 17 de marzo de 2012 11:12
Thanks Allan, glad (and relieved!) you have succeeded.
Some more new code there...
from: $sqlquery = "INSERT INTO $table (Lema, Fonetik, Malayu) VALUES ('$Lema', '$Fonetik', '$Malayu')";
You are a wonderful man, and a little bit mad!! :-)
Rob
-
sábado, 17 de marzo de 2012 17:33
Hello Roberts,
I am very sorry for the headache and lot of confusion. I do really appreciated what you've done and given lot of patience.
from: $sqlquery = "INSERT INTO $table (Lema, Fonetik, Malayu) VALUES ('$Lema', '$Fonetik', '$Malayu')";
that was the code i used to insert the data. From my experience with MYSQL and MSSQL commands. Then you introduce me the new command using ARRAY in SQLSRV. Which is:
$sqlresult = sqlsrv_query($sqlconnect, $sqlquery, array($Noun, $Fonetik, $Definition));
Then i do some readings with example from SQLSRV. And i think that was the mistake i made using the sample commands from MYSQL and MSSQL.
I do learned some new stuff. Thanks to you so much for the patience and helps.

