Html agility pack get inner text

Fastest Entity Framework Extensions

Bulk Insert

Bulk Delete

Bulk Update

Bulk Merge

public virtual string InnerText { get; }

Gets the text between the start and end tags of the object. InnerText is a member of HtmlAgilityPack.HtmlNode

Example

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/h2");

foreach (var node in htmlNodes)
{
    Console.WriteLine(node.InnerText);
}

Click here to run this example.


The HTML I use is as follows:

        

In my C# code, I want to extract the following content from the markup: "Copyright © FUCHS Online Ltd, 2013. All Rights ".

I have tried what the following:

   public string getvalue()
        {
            HtmlWeb web = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = web.Load("www.fuchsonline.com");
            var link = doc.DocumentNode.SelectNodes("//div[@id='footertext']");
            return link.ToString();
        }

This produces a "HtmlAgilityPack.HtmlNodeCollection" object. How can I get the text value alone?

Input

foo bar baz

Output

foo
bar
baz

I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.

Html agility pack get inner text

AakashM

61.6k17 gold badges153 silver badges185 bronze badges

asked Nov 15, 2010 at 8:25

XPATH is your friend :)

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"

foo bar baz

"); foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()")) { Console.WriteLine("text=" + node.InnerText); }

answered Nov 21, 2010 at 9:50

Html agility pack get inner text

Simon MourierSimon Mourier

127k19 gold badges242 silver badges290 bronze badges

3

var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim());
    }
}

This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.

answered Nov 15, 2010 at 9:15

DypplDyppl

12k9 gold badges46 silver badges68 bronze badges

I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:

StringBuilder sb = new StringBuilder();
IEnumerable nodes = doc.DocumentNode.Descendants().Where( n => 
    n.NodeType == HtmlNodeType.Text &&
    n.ParentNode.Name != "script" &&
    n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
    Console.WriteLine(node.InnerText);

answered Sep 24, 2014 at 16:20

1

var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;

The specified example for html content:

foo bar baz

will produce the following output:

foo bar baz

answered Dec 12, 2014 at 10:29

Html agility pack get inner text

Vadim GremyachevVadim Gremyachev

56.8k20 gold badges127 silver badges189 bronze badges

1

public string html2text(string html) {
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(@"" + html + "");
    return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}

This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).

answered Nov 11, 2015 at 16:29

Vito GentileVito Gentile

12.6k9 gold badges60 silver badges93 bronze badges

1

https://github.com/jamietre/CsQuery

have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.

var text = CQ.CreateDocument(htmlText).Text();

Here's a complete console application:

using System;
using CsQuery;

public class Program
{
    public static void Main()
    {
        var html = "

Hello World

some text inside h2 tag under p tag

"; var text = CQ.CreateDocument(html).Text(); Console.WriteLine(text); // Output: Hello World some text inside h2 tag under p tag } }

I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!

answered Oct 23, 2020 at 10:28

Html agility pack get inner text

Sunny SharmaSunny Sharma

4,4145 gold badges32 silver badges73 bronze badges

I just changed and fixed some people's answers to work better:

var document = new HtmlDocument();
        document.LoadHtml(result);
        var sb = new StringBuilder();
        foreach (var node in document.DocumentNode.DescendantsAndSelf())
        {
            if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
            {
                string text = node.InnerText?.Trim();
                if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
                    sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
            }
        }

answered Nov 3 at 12:16

Html agility pack get inner text

Ali YousefiAli Yousefi

2,2952 gold badges32 silver badges46 bronze badges