Why does Javascript hate me?

James Bengel · ‎10-15-2019

This is more of a general Javascript question (probably) than one specifically related to what happens in SN, but nothing I have found in an entire day's research has gotten me closer to an answer, so I'm wondering if anybody here has encountered (and conquered) this particular dragon.

What I am tasked with is figuring out how to remove embedded HTML tags from a field. And I'm almost there. In fact, removing tags is a relatively simple exercise (once one gets over the fear of regex), but there's a bit of wisdom I' heard once that states that the stories of most personal tragedies begin with the words "I decided"... And thus it is in this story. "I decided" htta it would be a good idea to replace line breaks and paragraphs with actual newlines. Which again seems like it would be a simple replacement, right? Something like this:

.replace(/<\s*\/?br\s*[\/]?>/gi, '\n')

Now I'll begin by telling you the regex works. I confirmed this by replacing the '\n' with 'line break' and with the input:

Niiiice<br/><div tabindex="-1" class="_rp_f1 _rp_g1" id="ItemHeader.ToContainer" role="heading"><span class="_rp_i1 ms-fwt-sb ms-font-color-black _rp_h1">To:</span> Somebody whose name is unimportant

that (plus some other replacements not shown here for brevity) results in the output

Niiiice line break To: Somebody whose name is unimportant

So it replaced the with the literal "line break" which tells me the search was successful. But returning to the original .replace() call, the output when I specify '\n' as the replacement string looks like:

Niiiice To: Somebody whose name is unimportant

No line break. Big frowny face.

Knowing that the script is finding the line break successfully, I'm forced to conclude that Javascript simply doesn't like outputting a '\n' as a newline. But wait, there's more. It also doesn't like any of these options -- all of which stubbornly yield the same result:

"\n",
'\r''\n',
'\r\n'
"\r\n",
'\r',
'\0x0a' (or 0d0a)
'\u000a' (or 000d 000a)
os.EOL (require os)

And given that this also happens in Node JS at the command prompt if I log output to the console, I'm starting to think this is just one more example of Javascript being Javascript, and wondering if it really matters that much if the breaks don't break. But every blog/forum/post/tutorial I run across that has an answer that sounds even remotely on point says that what I originally did should work. And yet. So before I tap out entirely I figured I'd put the question to the brethren here and see if anyone else has run up against this.

James Bengel · ‎10-21-2019

I see what did it now. It wasn't that the newlines weren't being inserted, it was that the last replacement before the trim that was taking them back out again

.replace(/\s{2,}/g, ' ');

Once the tags ad been replaced by line breaks, the \s{2,} was seeing them as whitespace, and pulling the line back up again, to collapse the extra spaces into a single space. And since the tag wasn't followed by a space and the tag was, there appeared to be some inexplicable discrepancy between the two replacements. But it was explicable once I broke the collection of .replace() calls into separate assignments and stuck a console.log between each one so I could see what the output from each call was. With that done, I could see that the DID initially convert as expected, and that somewhere below that it was "unconverted". At that point, it was just a matter of working back through the list of replacements to see which one preceded the "unconversion", and at that point it was obvious what had happened.

BUT, since the /\s{2,}/ is the final replacement before the trim, and every other tag that isn't converted to a discrete value is converted to a single space, all I had to do was replace the /\s{2,}/ with / / (two literal spaces), et voila!

My thanks to all who replied; every suggestion offered something to try, which led to the clues that ultimately led to a solution.

View solution in original post

Dubz · ‎10-21-2019

Did you try my suggested amendment above? It added line breaks for me....

function removeTags(string){
    return string.replace(/<br\/>/gm, '\n')
                .replace(/(<([^>]+)>)/ig, '')
                .replace(/&nbsp;/gi,' ')
                .replace(/&tab;/gi,' ')
                .replace(/&amp;/gi,'&')
                .replace(/&gt;/gi,'>')
                .replace(/&lt;/gi,'<')
                .replace(/&apos;/gi,"'")
                .replace(/\s{2,}/g, ' ')
                .trim();
  }

James Bengel · ‎10-21-2019

Still testing with it. It appears to work when the one specific formulation of is used, but if I try to expand it to capture other permutations:

.replace(/<\s*\/?br\s*[\/]?>/gmi, '\n')

it fails.

Also apparently problematic is using the backtick (``) to define a multi-line literal. When I try that, even yours fails. That's supposed to be legit in JS, and I haven't gotten any complaints about it from the interpreter, so I assume it's a legal expression, but it's throwing a wrench in the works for this purpose.

As long as everything is in one line, it seems to work until I try to include or or all of which are technically "legal".

James Bengel · ‎10-21-2019

Okay, as long as I keep it all on one line, the

.replace(/<\s*\/?br\s*[\/]?>/gim, '\n')

appears to work, but now I have the same problem with the tag replacement, which is even simpler. Again, if I replace it with the literal string 'para' it will do that, but '\n' yields no joy.

as in

.replace(/<\/p>/igm, '\n')

But it's progress I suppose.

Ajay_Chavan · ‎10-21-2019

hi there,

try below code , usually i always use it to deal with HTML

Now remove all the HTML tags from the HTML body

htmlcode = htmlcode.replace(/<style([\s\S]*?)<\/style>/gi, '');
htmlcode = htmlcode.replace(/<script>/gi, '');
htmlcode = htmlcode.replace(/<\/div>/ig, '\n');
htmlcode = htmlcode.replace(/<\/li>/ig, '\n');
htmlcode = htmlcode.replace(/<li>/ig, '  *  ');
htmlcode = htmlcode.replace(/<\/ul>/ig, '\n');
htmlcode = htmlcode.replace(/<\/p>/ig, '\n');
htmlcode = htmlcode.replace(/<br\s*[\/]?>/gi, "\n");
htmlcode = htmlcode.replace(/<[^>]+>/ig, '');
htmlcode=htmlcode.replace('  ','');

If this resolves your query, please mark my comments as correct and helpful .

Regards,

Ajay Chavan

My Community Articles

LinkedIn

Glad I could help! If this solved your issue, please mark it as ✅ Helpful and ✅ Accept as Solution so others can benefit too.*****Chavan A.P. | Technical Architect | Certified Professional*****

James Bengel · ‎10-21-2019

I saw this on your profile when you suggested it before. The problem I've been running into isn't with finding the tags -- that part works. It's been with inserting the new line in the output when the regex detects a or a variation of ( , , , ...). And in that regard I'm already doing what you suggested. I suspect it's working as well as it's going to. And it seems to be working well enough for what it was needed for.