Problem Statement as being an information scientist when it comes to marketing division at reddit.

Problem Statement as being an information scientist when it comes to marketing division at reddit.

i must discover the many predictive keywords and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll make use of Logistic Regression & Bayes models. Misclassifications in this instance could be fairly safe therefore I will make use of the precision rating and set up a baseline of 63.3per cent to price success. Making use of TFiDfVectorization, I’ll get the function value to ascertain which words have actually the prediction power that is highest for the prospective factors. If effective, this model is also utilized to focus on other pages which have comparable regularity associated with words that are same expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks with this component.

After switching most of the scrapes into DataFrames, we spared them as csvs that you can get within the dataset folder of the repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless in my experience.
  • combined name and selftext column directly into one brand new all_text columns
  • exambined distributions of term counts for games and selftext column per post and compared the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if i usually find the value that develops most frequently, i will be right 63.3% of that time period.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first set of scraping, pretty bad score with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got even worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Just enhancing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Including 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 But, these rating disappeared.

I do believe Tfidf worked the most effective to diminish my overfitting due to variance issue because

we customized the end terms to just just simply take the ones away which were really too regular to be predictive. It was a success, nonetheless, with increased time we most likely could’ve tweaked them much more to improve all ratings. Taking a look at both the solitary terms and terms in categories of two (bigrams) had been the most useful param that gridsearch proposed, nevertheless, most of my top many predictive terms finished up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term had been expected to show as much as 2, helped be rid of these. Gridsearch additionally proposed 90% max df rate which aided to eradicate oversaturated words too. Finally, establishing max features to 5000 reduced cut down my columns to about 25 % of whatever they were to simply concentrate the absolute most commonly used terms of the thing that was kept.

Summary and tips

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

and so I think the model is willing to launch a test. If marketing engagement increases, the exact same key phrases might be utilized to get other possibly profitable pages. I discovered it interesting that taking out fully the overly used words aided with overfitting, but brought the precision rating down. I do believe there was probably nevertheless space to relax and play around with the paramaters regarding the Tfidf Vectorizer to see if different end terms make an or that is different


Used Reddit’s API, demands library, and BeautifulSoup to clean articles from two subreddits: Dating guidance & union guidance, and trained a classification that is binary to anticipate which subreddit confirmed post originated from