Why does Javascript hate me?

James Bengel · ‎10-15-2019

This is more of a general Javascript question (probably) than one specifically related to what happens in SN, but nothing I have found in an entire day's research has gotten me closer to an answer, so I'm wondering if anybody here has encountered (and conquered) this particular dragon.

What I am tasked with is figuring out how to remove embedded HTML tags from a field. And I'm almost there. In fact, removing tags is a relatively simple exercise (once one gets over the fear of regex), but there's a bit of wisdom I' heard once that states that the stories of most personal tragedies begin with the words "I decided"... And thus it is in this story. "I decided" htta it would be a good idea to replace line breaks and paragraphs with actual newlines. Which again seems like it would be a simple replacement, right? Something like this:

.replace(/<\s*\/?br\s*[\/]?>/gi, '\n')

Now I'll begin by telling you the regex works. I confirmed this by replacing the '\n' with 'line break' and with the input:

Niiiice<br/><div tabindex="-1" class="_rp_f1 _rp_g1" id="ItemHeader.ToContainer" role="heading"><span class="_rp_i1 ms-fwt-sb ms-font-color-black _rp_h1">To:</span> Somebody whose name is unimportant

that (plus some other replacements not shown here for brevity) results in the output

Niiiice line break To: Somebody whose name is unimportant

So it replaced the with the literal "line break" which tells me the search was successful. But returning to the original .replace() call, the output when I specify '\n' as the replacement string looks like:

Niiiice To: Somebody whose name is unimportant

No line break. Big frowny face.

Knowing that the script is finding the line break successfully, I'm forced to conclude that Javascript simply doesn't like outputting a '\n' as a newline. But wait, there's more. It also doesn't like any of these options -- all of which stubbornly yield the same result:

"\n",
'\r''\n',
'\r\n'
"\r\n",
'\r',
'\0x0a' (or 0d0a)
'\u000a' (or 000d 000a)
os.EOL (require os)

And given that this also happens in Node JS at the command prompt if I log output to the console, I'm starting to think this is just one more example of Javascript being Javascript, and wondering if it really matters that much if the breaks don't break. But every blog/forum/post/tutorial I run across that has an answer that sounds even remotely on point says that what I originally did should work. And yet. So before I tap out entirely I figured I'd put the question to the brethren here and see if anyone else has run up against this.

James Bengel · ‎10-21-2019

I see what did it now. It wasn't that the newlines weren't being inserted, it was that the last replacement before the trim that was taking them back out again

.replace(/\s{2,}/g, ' ');

Once the tags ad been replaced by line breaks, the \s{2,} was seeing them as whitespace, and pulling the line back up again, to collapse the extra spaces into a single space. And since the tag wasn't followed by a space and the tag was, there appeared to be some inexplicable discrepancy between the two replacements. But it was explicable once I broke the collection of .replace() calls into separate assignments and stuck a console.log between each one so I could see what the output from each call was. With that done, I could see that the DID initially convert as expected, and that somewhere below that it was "unconverted". At that point, it was just a matter of working back through the list of replacements to see which one preceded the "unconversion", and at that point it was obvious what had happened.

BUT, since the /\s{2,}/ is the final replacement before the trim, and every other tag that isn't converted to a discrete value is converted to a single space, all I had to do was replace the /\s{2,}/ with / / (two literal spaces), et voila!

My thanks to all who replied; every suggestion offered something to try, which led to the clues that ultimately led to a solution.

View solution in original post

Brian Lancaster · ‎10-15-2019

What type of field are you trying to remove the HTML from?

James Bengel · ‎10-15-2019

In this case, we're talking about acceptance_criteria (on rm_story), which is typed as HTML, but I've run up against this in transform maps too where I had to treat the input as text because that was the lowest common denominator.

I've heard mentions of an innerText(?) property that would remove the need for this, but I haven't a clue how you'd access it in a server side script. Seems like you'd need direct DOM access for something like that. But my Javascript is almost entirely self-taught (on the fly, as needed) so there's a vast gulf between what I know and what I probably need to know.

Brian Lancaster · ‎10-15-2019

\n only works for string fields. You cannot use it in an HTML field. An HTML fields need to have HTML in it to format properly. The system should automatically put the HTML in for the person entering the text and keep it hidden in the background.

James Bengel · ‎10-18-2019

I'm not trying to put the HTML in, I'm trying to take it out. The tags are in the field, that's the problem. The output field is a string -- and as such the HTML isn't being rendered, it's simply being displayed. In the output. Making it nigh unreadable, and thus useless to me.

This is a known issue, actually, and apparently one that has been "resolved" by telling people "Don't do that".

What happens is this:

Two of my team have been given the job of generating reports that list items that contain one or more fields that contain HTML. And in said reports, the raw HTML tags are displaying rather than being rendered. This happens in several places in the system, which is where all those links above come from. But the requirement is such that "Don't do that" isn't a workable solution for us, so the two who got tasked with making it work asked if I could come up with a function that would remove the tags form the output so as to not have a lot of markup on the report. Since they're working in portal pages, using various types of list/report widgets, we can fit my little function into the server script and filter out the tags before handing them on to the widget. And it works, as far as it goes, in that given the input:

<div class="_rp_84"><div class="_rp_V"><div><div class="_rp_d1">Niiiice<br/>

I get the output:

Niiiice

after running it through the filter. And that gets passed on through the data object to appear on the report as plain text without all the markup. Mission accomplished. Almost.

The only thing that doesn't work, is trying to convert the (or or or what have you) to an actual line break ('\n'). The regex part of the replace works fine -- I can successfully find all those variants of and replace them with almost anything else except a newline.

This appears to be a case of Javascript being Javascript, because the same thing happens when I leave the confines of ServiceNow and try to do it with a simple Node JS variant at the command line.

var html = `<div class="_rp_84"><div class="_rp_V"><div><div class="_rp_d1">Niiiice<br/>
<div tabindex="-1" class="_rp_f1 _rp_g1" id="ItemHeader.ToContainer" role="heading">
<span class="_rp_i1 ms-fwt-sb ms-font-color-black _rp_h1">To:</span> Somebody whose name is unimportant, I just need more text to test with.
<p class="some_para_class_name">
Because I'm not certain how this will work with html read in from a field.</p>
&amp; I&apos;m even less sure if the &lt;escapes&gt; work
MLA Supplement => Automation Test Issue => 
<span style="color:#333333 ;font-family:Helvetica, Arial, sans-serif;line-height:19.2000007629395px;background-color:#dedede ;">Add New Action
<html>
<table>
MLA Supplement => <td>Automation</td> Test Issue => Add new <span style="color: #3333">Action</span>
</table>
<html>`

console.log(html);
console.log(removeTags(html));

function removeTags(string){
    return string.replace(/<\s*\/?br\s*[\/]?>/gi, '\n')
                .replace(/<\/p>/g, '\n')
                .replace(/<[^>]*>/g, ' ')
                .replace(/&nbsp;/gi,' ')
                .replace(/&tab;/gi,' ')
                .replace(/&amp;/gi,'&')
                .replace(/&gt;/gi,'>')
                .replace(/&lt;/gi,'<')
                .replace(/&apos;/gi,"'")
                .replace(/\s{2,}/g, ' ')
                .trim();
  }

Like so:

> node stripHTML.js
<div class="_rp_84"><div class="_rp_V"><div><div class="_rp_d1">Niiiice<br/>
<div tabindex="-1" class="_rp_f1 _rp_g1" id="ItemHeader.ToContainer" role="heading">
<span class="_rp_i1 ms-fwt-sb ms-font-color-black _rp_h1">To:</span> Somebody whose name is unimportant, I just need more text to test with.
<p class="some_para_class_name">
Because I'm not certain how this will work with html read in from a field.</p>
&amp; I&apos;m even less sure if the &lt;escapes&gt; work
MLA Supplement => Automation Test Issue =>
<span style="color:#333333 ;font-family:Helvetica, Arial, sans-serif;line-height:19.2000007629395px;background-color:#dedede ;">Add New Action
<html>
<table>
MLA Supplement => <td>Automation</td> Test Issue => Add new <span style="color: #3333">Action</span>
</table>
<html>

The (actual) resulting output looks like this:

Niiiice To: Somebody whose name is unimportant, I just need more text to test with. Because I'm not certain how this will work with html read in from a field. & I'm even less sure if the <escapes> work
MLA Supplement => Automation Test Issue => Add New Action MLA Supplement => Automation Test Issue => Add new Action

The first .replace() was intended to convert the following the string "Niiiice" into a newline. Giving me the (intended) output:

Niiiice
To: Somebody whose name is unimportant, I just need more text to test with.
Because I'm not certain how this will work with html read in from a field. & I'm even less sure if the <escapes> work
MLA Supplement => Automation Test Issue => Add New Action MLA Supplement => Automation Test Issue => Add new Action

(There's a tag that should have triggered the same conversion after the "to:" line which also didn't work, but what they both have in common is that none of the various values I"ve tried to use to insert a newline at that location have done so.)

At this point, we're just living with the fact that paragraphs and line breaks are going to end up converted to spaces, since the rest of it works fine. The point of this post was to see if one of the more learned scripters had any ideas on how to make the last bit of it behave.