Aziz ur Rahman

Random Thoughts


System.Net.HttpWebRequest - Arabic Data

Few days back, there was a task that I have to get/parse data from html pages on a site. I first tried to use System.Net.HttpWebRequest class to make the request, get the data.

Dim objRequest As System.Net.HttpWebRequest = System.Net.WebRequest.Create(Url)
Dim result As String
objRequest.Method = "GET"

Dim objResponse As System.Net.HttpWebResponse = objRequest.GetResponse()
Dim sr As System.IO.StreamReader
sr = New System.IO.StreamReader(objResponse.GetResponseStream())
result = sr.ReadToEnd()
Return result


It worked fine but I was getting corrupt Arabic data (the site was in Arabic). I played with the stream classes and found the solution. One have to include the encoding while streaming response.

sr = New System.IO.StreamReader(objResponse.GetResponseStream(), System.Text.Encoding.UTF8)

After getting the data in correct format, I tried to use XmlDocument to load the result but again there was a problem. XmlDocument was unable to load the result throwing exceptions. After some checking I found out the XmlDocument was doing thsi due to the html tags that do not have ending tags. e.g. <br>, <hr>, nowrap, <Img> etc. Then I applied some formatting on the result like

strMatter = strMatter.Replace("<BR>", "")
strMatter = strMatter.Replace("nowrap", "")
strMatter = strMatter.Replace("pointer;"">", "pointer;""></IMG>")
strMatter = strMatter.Replace("pointer;"" >", "pointer;""></IMG>")

Then I successfully parsed and saved the data in database. Is there any corresponding class for Html like for Xml we have XmlDocument that can easily load html and parse it???

Posted: Saturday, February 4, 2006 9:45 AM by aziz
