txt-file is UTF8 or DOS.

J

Joergen Bondesen

Hi NG

I am using Excel VBA to open a txt-file and evaluate data.

Sometimes I receive a txt-file in UTF8 and not DOS. This is a big probleme
for me.

Is there a VBA way to 'controle' the txt-file and give me feedback for UTF8
and DOS.

Is it with VBA possible to converte the file from UTF8 to DOS. Today I am
using UltraEdit.
 
M

Michel Pierron

Hi Joergen,
Something like:

Option Explicit
Private Declare Function MultiByteToWideChar Lib "kernel32" _
(ByVal CodePage As Long, ByVal dwFlags As Long _
, ByVal lpMultiByteStr As String, ByVal cchMultiByte As Long _
, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Declare Function WideCharToMultiByte Lib "kernel32" _
(ByVal CodePage As Long, ByVal dwFlags As Long _
, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long _
, ByVal lpMultiByteStr As String, ByVal cchMultiByte As Long _
, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long
Private Const CP_UTF8 = 65001

Sub Joergen()
Const sPath$ = "Path of the file to be tested\"
Const sFile$ = "The name of the text file to be tested.txt"
Const tFile$ = "My decoded file.txt"
MsgBox Utf8Encoding(sPath & sFile, sPath & tFile), 64
End Sub

' Returns True if encoding = utf-8
' If encoding utf-8 then convert to ANSI
' and save under SaveAs (if SaveAs = ""
' then replace txtFile).
Private Function Utf8Encoding(txtFile As String _
, Optional SaveAs As String = "") As Boolean
If InStr(1, txtFile, ".txt", 1) = 0 Then Exit Function
If Dir(txtFile) = "" Then Exit Function
Dim i&, b() As Byte
Dim Buffer$, f%: f = FreeFile
Open txtFile For Binary Access Read As #f
Buffer = String(LOF(f), Chr(0))
Get #f, , Buffer
Close #f
For i = 1 To 3 ' UTF-8 BOM = EF BB BF
If Asc(Mid(Buffer, i, 1)) <> Choose(i, 239, 187, 191) _
Then Exit Function
Next i
Buffer = Mid$(Buffer, 4)
Dim s As Variant: s = Split(Buffer, vbCrLf)
If SaveAs = "" Then SaveAs = txtFile
On Error Resume Next
Kill SaveAs ' Delete any existing file.
On Error GoTo 0
f = FreeFile ' Save the file.
Open SaveAs For Output As #f
For i = 0 To UBound(s)
Print #f, UTF8ToA(s(i))
Next i
Close f
Utf8Encoding = True
End Function

Private Function UTF8ToA(ByVal wText As String) As String
Dim vNeeded&, vSize&: vSize = Len(wText)
vNeeded = MultiByteToWideChar(CP_UTF8, 0, wText, vSize, 0, 0)
UTF8ToA = String(vNeeded, 0)
MultiByteToWideChar CP_UTF8, 0, wText, vSize, StrPtr(UTF8ToA), vNeeded
End Function

Regards,
MP
 
J

Joergen Bondesen

Hi Michel

Thanks, I am impressed.
Look below where I have added my comments starting with '//, please.
I do hope you can help me.


Option Explicit

Private Declare Function MultiByteToWideChar Lib "kernel32" _
(ByVal CodePage As Long, ByVal dwFlags As Long _
, ByVal lpMultiByteStr As String, ByVal cchMultiByte As Long _
, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long


Private Declare Function WideCharToMultiByte Lib "kernel32" _
(ByVal CodePage As Long, ByVal dwFlags As Long _
, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long _
, ByVal lpMultiByteStr As String, ByVal cchMultiByte As Long _
, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long

Private Const CP_UTF8 = 65001

Sub Joergen()
'// My consts
Const sPath$ = "C:\"
Const sFile$ = "UTF8toDOS.txt"
Const tFile$ = "My decoded file.txt"

MsgBox Utf8Encoding(sPath & sFile, sPath & tFile), 64
End Sub


Private Function UTF8ToA(ByVal wText As String) As String
Dim vNeeded&, vSize&: vSize = Len(wText)
vNeeded = MultiByteToWideChar(CP_UTF8, 0, wText, vSize, 0, 0)
UTF8ToA = String(vNeeded, 0)
MultiByteToWideChar CP_UTF8, 0, wText, vSize, StrPtr(UTF8ToA), vNeeded
End Function


' Returns True if encoding = utf-8
' If encoding utf-8 then convert to ANSI
' and save under SaveAs (if SaveAs = ""
' then replace txtFile).
Private Function Utf8Encoding(txtFile As String _
, Optional SaveAs As String = "") As Boolean

If InStr(1, txtFile, ".txt", 1) = 0 Then Exit Function

If Dir(txtFile) = "" Then Exit Function

Dim i&, b() As Byte
Dim Buffer$, f%: f = FreeFile

Open txtFile For Binary Access Read As #f
Buffer = String(LOF(f), Chr(0))
Get #f, , Buffer
Close #f

'// When I read the file in UltraEdit (texteditor) in
'// hex mode, the 2 first sign is '255, 254' "disappears".
'// When I converte the file to DOS, these 2 sign
'// so I conclude (prehaps wrong) I can determine if I
'// have a unix file.

'// I also think that: 'Get #f, , Buffer' give me a "DOS"
'// line so below is not the way. Can you help me further.

' For i = 1 To 3 ' UTF-8 BOM = EF BB BF
' If Asc(Mid(Buffer, i, 1)) <> Choose(i, 239, 187, 191) _
' Then Exit Function
' Next i


'// This cut the first 3 sign in firste line so therefore
'// have I change 4 to 1.
'Buffer = Mid$(Buffer, 4)
Buffer = Mid$(Buffer, 1)

Dim s As Variant: s = Split(Buffer, vbCrLf)
If SaveAs = "" Then SaveAs = txtFile

On Error Resume Next
Kill SaveAs ' Delete any existing file.
On Error GoTo 0

f = FreeFile ' Save the file.
Open SaveAs For Output As #f

For i = 0 To UBound(s)
'// Avoid "empty" line after last record
Dim NewLine As String
NewLine = UTF8ToA(s(i))

If NewLine <> vbNullString Then
Print #f, NewLine
End If
Next i
' For i = 0 To UBound(s)
' Print #f, UTF8ToA(s(i))
' Next i
Close f

Utf8Encoding = True
End Function
 
M

Michel Pierron

Hi Joergen,
If the first two signs of your file correspond to 255 and 254, it is that
your file is not encoded in UTF-8, but in UTF16 litle endian and it is much
simpler to decode it:

Sub Joergen_2()
'// My consts
Const sPath$ = "C:\"
Const sFile$ = "UTF16toDOS.txt"
Const tFile$ = "My_Result_file.txt""

If GetEncoding(sPath & sFile) <> "UTF16L" Then Exit Sub

' Convert Unicode to Ascii
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Const ModeAscii = 0, ModeUnicode = -1
Dim fso As Object, f_in As Object, f_out As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Set f_in = fso_OpenTextFile(sPath & sFile, ForReading, , ModeUnicode)
Set f_out = fso_OpenTextFile(sPath & tFile, ForWriting, True, ModeAscii)
Do Until f_in.AtEndOfStream
f_out.Write f_in.Read(1)
Loop
f_in.Close: f_out.Close
Set f_out = Nothing: Set f_in = Nothing: Set fso = Nothing
End Sub

Private Function GetEncoding(txtFile As String) As String
Dim b(1) As Byte, f%: f = FreeFile
Open txtFile For Binary Access Read As #f
Get #f, , b
Close #f
If b(0) = &HEF And b(1) = &HBB Then
GetEncoding = "UTF-8"
ElseIf b(0) = &HFF And b(1) = &HFE Then
' Litle endian unicode (ucs-2le, ucs-4le, and ucs-16le)
GetEncoding = "UTF16L"
ElseIf b(0) = &HFE And b(1) = &HFF Then
' Big endian unicode (utf-16 and ucs-2)
GetEncoding = "UTF16B"
Else
GetEncoding = "ANSI"
End If
End Function

Regards,
MP
 
J

Joergen Bondesen

Hi Michel

Thanks, but it do not function.

If i read my inputfile in Notepad the 2 first sign is "I (34 and 73)

If I read the file in UltraEdit in Hex mode if says: ÿþ"I or (FF FE 22
00 49)

Private Function GetEncoding(txtFile As String) As String

b(0) = 34
b(1)= 73

so I do not read in Binary or ??

*************************

if I use Reference: Microsoft Script Runtime and forcing: GetEncoding =
"UTF16L", I get error at: f_out.Write f_in.Read(1) runtime error '5'

--
Best regards
Joergen Bondesen


Michel Pierron said:
Hi Joergen,
If the first two signs of your file correspond to 255 and 254, it is that
your file is not encoded in UTF-8, but in UTF16 litle endian and it is
much simpler to decode it:

Sub Joergen_2()
'// My consts
Const sPath$ = "C:\"
Const sFile$ = "UTF16toDOS.txt"
Const tFile$ = "My_Result_file.txt""

If GetEncoding(sPath & sFile) <> "UTF16L" Then Exit Sub

' Convert Unicode to Ascii
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Const ModeAscii = 0, ModeUnicode = -1
Dim fso As Object, f_in As Object, f_out As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Set f_in = fso_OpenTextFile(sPath & sFile, ForReading, , ModeUnicode)
Set f_out = fso_OpenTextFile(sPath & tFile, ForWriting, True, ModeAscii)
Do Until f_in.AtEndOfStream
f_out.Write f_in.Read(1)
Loop
f_in.Close: f_out.Close
Set f_out = Nothing: Set f_in = Nothing: Set fso = Nothing
End Sub

Private Function GetEncoding(txtFile As String) As String
Dim b(1) As Byte, f%: f = FreeFile
Open txtFile For Binary Access Read As #f
Get #f, , b
Close #f
If b(0) = &HEF And b(1) = &HBB Then
GetEncoding = "UTF-8"
ElseIf b(0) = &HFF And b(1) = &HFE Then
' Litle endian unicode (ucs-2le, ucs-4le, and ucs-16le)
GetEncoding = "UTF16L"
ElseIf b(0) = &HFE And b(1) = &HFF Then
' Big endian unicode (utf-16 and ucs-2)
GetEncoding = "UTF16B"
Else
GetEncoding = "ANSI"
End If
End Function

Regards,
MP
 
M

Michel Pierron

Hi Joergen,

Sometimes, the programmer does not comply with the rules and forgets the
heading of the file.
It is apparement the case as the GetEncoding procedure shows it (b(0) = 34
and b(1) = 73).
However, the detail of UltraEdit editor shows well that it acts of a file to
the format unicode and it seems that in this case, the corresponding heading
is automatically added.

NotePad recognizes automatically a file of the type ANSI, utf-8 or unicode
and never indicates the heading if it is present. With NotePad, You have
also the possibility of saving the file running in a different format.

Here, another method to test if the file is unicode format:

Option Explicit
Private Declare Function IsTextUnicode Lib "advapi32" _
(ByVal lpBuffer As String, ByVal cb As Long, lpi As Long) As Long

' Not need reference to Microsoft Script Runtime with this procedure
Sub Joergen_3()
'// My consts
Const sPath$ = "C:\"
Const sFile$ = "UTF16toDOS.txt"
Const tFile$ = "My_Result_file.txt"
If Not IsUnicode(sPath & sFile) Then Exit Sub

' Convert Unicode to Ascii
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Const ModeAscii = 0, ModeUnicode = -1
Dim fso As Object, f_in As Object, f_out As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Set f_in = fso_OpenTextFile(sPath & sFile, ForReading, , ModeUnicode)
Set f_out = fso_OpenTextFile(sPath & tFile, ForWriting, True, ModeAscii)
Do Until f_in.AtEndOfStream
f_out.Write f_in.Read(1)
Loop
f_in.Close: f_out.Close
Set f_out = Nothing: Set f_in = Nothing: Set fso = Nothing

MsgBox "The unicode file " & sFile & vbLf _
& "was converted into ansi file " & tFile & " !", 64
End Sub

' Returns True if sBuffer evaluates to a Unicode string
Private Function IsUnicode(txtFile As String) As Boolean
Dim Buffer$, f%: f = FreeFile
Open txtFile For Binary Access Read As #f
Buffer = String(LOF(f), Chr(0))
Get #f, , Buffer
Close #f
IsUnicode = IsTextUnicode(ByVal Buffer, Len(Buffer), &HF)
If Not IsUnicode Then MsgBox "It is not a unicode file !", 64
End Function

Regards,
MP
 
J

Joergen Bondesen

Hi Michel

Your code can run after I placet below in the top of the Module.

Private Declare Function IsTextUnicode Lib "advapi32" _
(ByVal lpBuffer As String, ByVal cb As Long, lpi As Long) As Long


I still get the message: "It is not an unicode file !" despite I can se it
is an unicode file.

Perhaps, if you think it is an good idea, I can send you the txt-file and my
spreadsheet, so we can make out if we are misunderstanding each other?
 
M

Michel Pierron

No problem Joergen, send to me your files if you want that I look at what
does not go.

Best regards,
Michel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top