Predictive Intelligence - several questions about the way it works

julienschmi · ‎11-04-2024

Hi everyone,

I am used to write Python scripts to perform some NLP tasks. I am currently trying Predictive Intelligence, and I have several questions concerning the way it works.

Preprocessing

How are handled the following normalization steps ?

- Lowercasing

- Accentuated characters

- Special characters

- Stemming/Lemmatization

- When dealing with multiple input text fields (namely: Short Description, Description, Additional Informations), it happens that these fields contain the exact same content. Is there a way to add condition such as: if the aformentioned fields contain the same piece of text, then PI only takes into account one of them ?

- Concerning Stopwords, is it a good practice to add named entities, like names of persons or locations ?

- With the deletion of Word Corpus for Similarity and Clustering, what is the point of having different embedding techniques ? (Universal Sentence Encoder for Similarity/Clustering and Doc2Vec/TF-IDF/GloVe for Classification)

Machine Learning Pipelines

Can you confirm that models used are still these ones:

- Logistic Regression, Decision Trees, and Random Forests for Classification.

- k-Nearest Neighbors (k-NN) and Cosine Similarity for Similarity.

- k-Means, DBSCAN and HDBSCAN for Clustering.

- Linear Regression and Support Vector Regression (SVR) for Regression

For Classification, I want to implement incident categorization, to do so I want to predict Category and Subcategory. Is it possible to predict both at the same time. Should I create 2 different models ?

For Similarity, I can only apply filters for the Table, and not for the Test Table. How can I bypass this problem ? Should I create a Database View to use in the Test Table ?

For Clustering, the default model is kMeans, but I can't find any field to specify the number of clusters. Is it automatically done by PI, if yes, how ?

Thank you in advance !

Regards,

Julien