Why does Javascript hate me?

James Bengel · ‎10-15-2019

This is more of a general Javascript question (probably) than one specifically related to what happens in SN, but nothing I have found in an entire day's research has gotten me closer to an answer, so I'm wondering if anybody here has encountered (and conquered) this particular dragon.

What I am tasked with is figuring out how to remove embedded HTML tags from a field. And I'm almost there. In fact, removing tags is a relatively simple exercise (once one gets over the fear of regex), but there's a bit of wisdom I' heard once that states that the stories of most personal tragedies begin with the words "I decided"... And thus it is in this story. "I decided" htta it would be a good idea to replace line breaks and paragraphs with actual newlines. Which again seems like it would be a simple replacement, right? Something like this:

.replace(/<\s*\/?br\s*[\/]?>/gi, '\n')

Now I'll begin by telling you the regex works. I confirmed this by replacing the '\n' with 'line break' and with the input:

Niiiice<br/><div tabindex="-1" class="_rp_f1 _rp_g1" id="ItemHeader.ToContainer" role="heading"><span class="_rp_i1 ms-fwt-sb ms-font-color-black _rp_h1">To:</span> Somebody whose name is unimportant

that (plus some other replacements not shown here for brevity) results in the output

Niiiice line break To: Somebody whose name is unimportant

So it replaced the with the literal "line break" which tells me the search was successful. But returning to the original .replace() call, the output when I specify '\n' as the replacement string looks like:

Niiiice To: Somebody whose name is unimportant

No line break. Big frowny face.

Knowing that the script is finding the line break successfully, I'm forced to conclude that Javascript simply doesn't like outputting a '\n' as a newline. But wait, there's more. It also doesn't like any of these options -- all of which stubbornly yield the same result:

"\n",
'\r''\n',
'\r\n'
"\r\n",
'\r',
'\0x0a' (or 0d0a)
'\u000a' (or 000d 000a)
os.EOL (require os)

And given that this also happens in Node JS at the command prompt if I log output to the console, I'm starting to think this is just one more example of Javascript being Javascript, and wondering if it really matters that much if the breaks don't break. But every blog/forum/post/tutorial I run across that has an answer that sounds even remotely on point says that what I originally did should work. And yet. So before I tap out entirely I figured I'd put the question to the brethren here and see if anyone else has run up against this.

James Bengel · ‎10-21-2019

I see what did it now. It wasn't that the newlines weren't being inserted, it was that the last replacement before the trim that was taking them back out again

.replace(/\s{2,}/g, ' ');

Once the tags ad been replaced by line breaks, the \s{2,} was seeing them as whitespace, and pulling the line back up again, to collapse the extra spaces into a single space. And since the tag wasn't followed by a space and the tag was, there appeared to be some inexplicable discrepancy between the two replacements. But it was explicable once I broke the collection of .replace() calls into separate assignments and stuck a console.log between each one so I could see what the output from each call was. With that done, I could see that the DID initially convert as expected, and that somewhere below that it was "unconverted". At that point, it was just a matter of working back through the list of replacements to see which one preceded the "unconversion", and at that point it was obvious what had happened.

BUT, since the /\s{2,}/ is the final replacement before the trim, and every other tag that isn't converted to a discrete value is converted to a single space, all I had to do was replace the /\s{2,}/ with / / (two literal spaces), et voila!

My thanks to all who replied; every suggestion offered something to try, which led to the clues that ultimately led to a solution.

View solution in original post

Dubz · ‎10-18-2019

Hi James,

Try out this and let me know if it works, i tested on my PDI and it seemed to do the trick.

var text = '`<div class="_rp_84"><div class="_rp_V"><div><div class="_rp_d1">Niiiice<br/><div tabindex="-1" class="_rp_f1 _rp_g1" id="ItemHeader.ToContainer" role="heading"><span class="_rp_i1 ms-fwt-sb ms-font-color-black _rp_h1">To:</span> Somebody whose name is unimportant, I just need more text to test with.<p class="some_para_class_name">Because I\'m not certain how this will work with html read in from a field.</p>&amp; I&apos;m even less sure if the &lt;escapes&gt; workMLA Supplement => Automation Test Issue => <span style="color:#333333 ;font-family:Helvetica, Arial, sans-serif;line-height:19.2000007629395px;background-color:#dedede ;">Add New Action<html><table>MLA Supplement => <td>Automation</td> Test Issue => Add new <span style="color: #3333">Action</span></table><html>`';

var newText = removeTags(text);

gs.log(newText);

function removeTags(string){
    return string.replace(/<br\/>/gm, '\n')
                .replace(/(<([^>]+)>)/ig, '')
                .replace(/&nbsp;/gi,' ')
                .replace(/&tab;/gi,' ')
                .replace(/&amp;/gi,'&')
                .replace(/&gt;/gi,'>')
                .replace(/&lt;/gi,'<')
                .replace(/&apos;/gi,"'")
                .replace(/\s{2,}/g, ' ')
                .trim();
  }

James Bengel · ‎10-21-2019

Turned out that the last replace (\s{2,}) was collapsing the newly split lines back into one if the tag was followed by whitespace. Which is why <div tabindex="-1" stayed split and followed by & stayed split, but when as followed by a space it got pulled back up into the previous line. Because regex treats \n as whitespace. And \s thus, includes \n (and several other escapes, but that was the one at issue here).

Dubz · ‎10-22-2019

Ha, so it's not Javascript that hates you, it's regex! You know what they say, if you have a problem and add a regex, now you have two problems!

Ajay_Chavan · ‎10-18-2019

Hi buddy will you please check out my profile I have created one article with code to remove HTML tag , I’m sure it will help u

Glad I could help! If this solved your issue, please mark it as ✅ Helpful and ✅ Accept as Solution so others can benefit too.*****Chavan A.P. | Technical Architect | Certified Professional*****

James Bengel · ‎10-21-2019

You're doing essentially the same thing I did except in a slightly different order and with a slightly different structure. You've broken out each replace on a separate line, where I just strung mine together. Apart from that, we're using essentially the same replacements -- at least for the replacements that are causing me grief.

Yours (partial):

htmlcode = htmlcode.replace(/<\/p>/ig, '\n');
htmlcode = htmlcode.replace(/<br\s*[\/]?>/gi, "\n");

Mine:

function removeTags(string){
    return string.replace(/<\s*\/?br\s*[\/]?>/gi, '\n')
                .replace(/<\/p>/gi, '\n')
                .replace(/<[^>]*>/gi, ' ')
                .replace(/&nbsp;/gi,' ')
                .replace(/&tab;/gi,' ')
                .replace(/&amp;/gi,'&')
                .replace(/&gt;/gi,'>')
                .replace(/&lt;/gi,'<')
                .replace(/&apos;/gi,"'")
                .replace(/\s{2,}/g, ' ')
                .trim();
  }

The problem wasn't that the egex wasn't finding the tags, the problem was -- and is -- that the replacement string '\n' isn't being translated to a newline in the output. If you're actually getting the '\n' to insert a linebreak in the output, I would love to know how you got that to work, because I"m using the same replace and it is stubbornly refusing to do that for me.

What I have seems to be doing the job that's required for this particular need, but I can foresee other use cases where that line break issue will be a problem.