In refactoring the content upload process on our core domestic and international sites we've proven that we can get content at scale as part of user engagement with our network of sites. With a growing library of content along with legacy content quality checks we doubled down in investing internal technology to validate our library as housing quality documents and structured for easy navigation and discovery.
Below outlines internal strategy around enforcing content quality on ingestion into our library to allow for structured organization and discovery.
Content Quality Signals
Our legacy content checks on upload were basic in validating basic file formats for upload, document size, and virus checks. We added the following checks as signals to which we could analyze content and isolate spam patterns. For instance having a URL in a file can showcase high quality if it's in reference to a citation but low quality spam if it's an external link based on bot traffic using our site as a link farm.
- Repeated Characters
- Phone Numbers
- Conspicuous Words
With these checks in place we were able to institute a base logic layer to weight these signals against each other as a training set to isolate spam patterns.
Additional Content Checks
On top of the following checks we added to scope the following content checks to optimize uploads across our network as well as for newly expanded allowed content types.
IP Check / Location
This filter was intended as a means to isolate users who are uploading at massive volume as a means of populating our site with marketing spam and low quality content. This volume check was integrated as a check within the upload flow on a user ID basis outside of the spam checks and that volume to which they are uploading will need to be validated by the individual checks.
With an expansion of content types allowed onto the sites beyond just legacy word, text, or PDF documents and the introduction of spreadsheets, powerpoint, etc we refactored our content length signals here to conditionally enforce minimum content requirements on a case by case basis.
The foreign language filter was placed on all sites and routes all uploads to the corresponding network site we power based on language. We have sites that support Spanish, Portuguese, and French documents but legacy technology would cause any attempt to upload any on StudyMode to be rejected. With this logic updated we created a centralized document solution where we could store all documents in our supported countries then present natively there.
Managing Duplicate Content
In looking at our external duplicate checks we needed to be able to validate the following types of duplicates to circumvent any potential issues with Google's pengin and pirate algorithms for any content that may have been scraped from our site in the past.
- True Duplicates - any page that is 100% identical (in content) to another page. These pages only differ by the URL:
- Near Duplicates - content that differs from another page (or pages) by a very small amount
- Cross Domain Duplicates - occurs when two websites share the same piece of content.
After purging our content library of spam with the above automated checks we added to our tech stack natural language processing integrations via the Text Razor and Alchemy API services. This allowed us to identify, via ranked relevance, the most important concepts covered in a document. With this analysis we can then expose those relevant concept tags for the essay to help users quickly understand what the document is about. The last key component of this integration was to classify a document into our larger topic taxonomy according to its most relevant topic
- To be able to accurately identify the key topics covered in our research and writing documents and either replace or supplement (based on analysis) the current tags determined by TextRazor.
- Relevant categorization of documents based on primary concept tag.
- To generate tens of thousands of StudyMode topic pages, based on Alchemy (and potentially TextRazor) topic tags, which feature at least 10 StudyMode documents each.
- These topics will be mapped to our taxonomy, made indexable, and will help us to rank for popular head terms (e.g. Martin Luther King essays) relevant to our space.
In order to gain the most insight and value into our document database, we set out to provide each document with a weighted quality score. This first version of the Quality Algorithm used a logistic model to predict whether a document would be one that we would accept as ones that showcase our database.
Defining Input Variables
Outside of the spam filter checks we integrated the below signals via open source libraries such as the Flesch-Kincaid Reading Ease score. Below were the variables we created our initial model from:
- Readability score
- Text Razor synthesized category relevancy
- Popularity scoring
- Sentence complexity
Defining a Validation and Training Set
In order to train the model, a training and validation set of documents were required. A rater was assigned to rate each document and indicate if it is one he/she would accept or reject. This rater was validated as an expert at rating the documents and the sample set included documents across the good to bad spectrum ranging from very good to good to marginally good to marginally bad to bad to very bad.
The resulting score formulated a median across our entire document library based on the quality of our existing library. While we were not able to create an arbitrary score such as your traditional A, B, C content score. This gave us though the upper and lower bounds and an integration point into an internal human curation workflow that we constructed.
Internal Curation Workflow
Being that we can't just rely on signals for 100% trust in only having quality content on our site we integrated the above mentioned technology validation with a layer of machine learned human feedback. We spun up a team in India and blended technology along with human validated curation to find a level of confidence in the quality of our database. The full end to end content quality check and human validation then was constructed from the below workflow:
Now that we have the above technology and validation workflow in place what does that gain us in terms of site integration and enhancements? With our cleansed and structured database we now are able to give to users high quality topic pages, a restructured categorization hierarchy tree as well as an easier content discovery method.
Topic Landing Pages
Here we are creating tag pages based on a document relevancy data from Alchemy and a total number of document threshold. Basically we didn't want to show a topic related to a document unless it matches a certain threshold to limit cases of pages with thin content. This creates a concept graph of related topics that lives on top of the document library.
Topic pages were then integrated into a refactored taxonomy tree in which we replaced our legacy three tiered category taxonomy with one that was tuned to correlate to search demand as well as allow for our introduced topic pages to be nested within the hierarchal structure.
- Essays Home
- Category 1
- Category 2
Enhancing Document Discovery
The last user facing gain of this curation and content structuring update was availing to users advanced content filtering tools to drill down to hyper specific content without needing to use our legacy search. Here users were able to string together content with hyper focused topical references such as Martin Luther King Jr. + Rosa Parks + NAACP. Here users received validation as to a documents relevancy to match demand.
Overall this internal content brain continues to be developed but has created not only user value but internal IP and an investable technology platform.