How Can I Strip Html From Text In .net?

June 25, 2024 Post a Comment

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database. On the server, I would like to take strip the html from the t

Solution 1:

I downloaded the HtmlAgilityPack and created this function:

stringStripHtml(string html)
{
    // create whitespace between html elements, so that words do not run together
    html = html.Replace(">","> ");

    // parse htmlvar doc = new HtmlAgilityPack.HtmlDocument();   
    doc.LoadHtml(html);

    // strip html decoded text from htmlstring text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);   

    // replace all whitespace with a single space and remove leading and trailing whitespacereturn Regex.Replace(text, @"\s+", " ").Trim();
}

Solution 2:

Take a look at this Strip HTML tags from a string using regular expressions

Solution 3:

Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method

Solution 4:

TextReader tr = new StreamReader(@"Filepath");
string str = tr.ReadToEnd();     
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);

but you need to have a namespace referenced i.e:

system.text.RegularExpressions

only take this logic for your website

Solution 5:

If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:

publicstaticstring StripTags(string value)
    {
        if (value == null)
            returnstring.Empty;

        string pattern = @"&.{1,8};";
        value = Regex.Replace(value, pattern, " ");
        pattern = @"<(.|\n)*?>";
        return Regex.Replace(value, pattern, string.Empty);
    }

It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...

Html5 Playground