Sentiment Analysis

Sentiment analysis is the classification of consumer text — reviews, social posts, comments — into positive, negative, or neutral labels, optionally with intensity scores. It is the methodological core of Consumer Insights and the engine behind sentiment-driven Social Listening.

It is also the most over-claimed and under-verified part of most market intelligence stacks. Headline accuracy figures advertised by sentiment vendors are typically measured on broad English benchmarks; production accuracy on a specific category in a specific language can be twenty to thirty percentage points lower. Practitioners who treat sentiment scores as ground truth without verifying category-level accuracy ship decisions on a foundation they have not inspected.

How sentiment classifications are produced#

Modern sentiment pipelines run in stages:

$Four-stage sentiment pipeline: pre-processing (language detection, tokenisation, emoji and markup), classification (polarity label and intensity score), attribute decomposition (feature extraction and per-attribute polarity), aggregation (category and brand roll-ups, time-series), with a diagnostic-checks node attached to aggregation covering neutral-fraction floor, category-tuned validation, and source-record drill-down.$

Figure 1. The sentiment pipeline. Stages 1 and 2 are commodity; stages 3 and 4, plus the diagnostic checks attached, are where intelligence-grade datasets separate from raw extract outputs.

Pre-processing — language detection, tokenisation, handling of emoji, hashtags, and platform-specific markup
Classification — assigning a polarity label (positive / negative / neutral) and optionally an intensity score
Attribute-level decomposition — moving from "this review is negative" to "this review is negative about packaging"
Aggregation — rolling up individual classifications into category, brand, or attribute-level distributions

Steps 1 and 2 are commodity capabilities — most providers do them at similar baseline quality. Steps 3 and 4 are where intelligence-grade datasets separate from raw extract outputs. A pipeline that produces overall sentiment but not attribute-level decomposition cannot answer the questions practitioners actually need to answer.

Polarity versus intensity#

A common reporting mistake is treating polarity (positive / negative / neutral) as the only signal. Intensity matters more for several use cases:

Early-warning detection. A small share of very angry reviews on a specific attribute often predicts reformulation pressure better than a larger share of mildly negative reviews.
Brand health. Neutral-leaning sentiment on a category leader is a stronger competitive signal than evenly mixed positive/negative on a challenger.
Comparative benchmarking. Two products with the same percent-positive can have very different intensity profiles; the one with a flat intensity curve has more durable preference than the one with a long tail of mild positives.

Datasets that surface only polarity hide the intensity dimension. Buyers should verify that the underlying scores are exposed, not just the labels.

Where sentiment analysis fails#

Two failure modes recur regardless of vendor:

Category vocabulary gaps. Generic models miss category-specific terminology. A skincare review that mentions a specific ingredient as a positive ("烟酰胺含量足") may be classified as neutral because the model has no signal that the ingredient name carries valence in the category. The fix is category-specific fine-tuning, ideally with analyst-labelled data from the same platform mix the model will be applied to.

Cultural framing. Sarcasm, indirect criticism, and culturally-modulated praise routinely flip sentiment from what the literal words suggest. Chinese review text in particular often uses oblique framing where a five-star rating accompanies what reads as mild criticism in the body. English-trained models miss this systematically.

Both failure modes share a common diagnostic: the neutral fraction on a held-out sample. A category-appropriate model classifies under twenty percent of reviews as neutral. A poorly-tuned model dumps thirty to sixty percent into neutral because it cannot read the valence. The neutral fraction is the quickest single check for whether a sentiment dataset is fit for purpose in a given category.

Sentiment vs theme detection#

Sentiment is valence on a sentence or document. Theme detection is what the text is about. The two are different operations and require different models.

A review that says "the smell is amazing but the texture feels cheap" has mixed sentiment overall, positive sentiment on the smell attribute, and negative sentiment on the texture attribute. Producing this decomposition requires sentiment + attribute extraction operating jointly — not sentiment alone.

Vendors that report only document-level sentiment cannot deliver attribute-level insights. Vendors that report attribute-level sentiment but cannot expose the underlying source text cannot be verified.

What makes a sentiment dataset trustworthy#

Five properties separate a defensible sentiment dataset from a marketing artefact:

Category-specific accuracy is reported, not just overall accuracy
Validation set is human-labelled by category-familiar analysts, not crowdworkers
Language mix of the validation set matches the production data
Neutral fraction on a held-out sample is under twenty percent
Source posts are accessible for analyst spot-checking and audit

Property 5 is the audit floor. A dataset that does not let an analyst drill from "X% negative on packaging" back to the actual reviews making that classification cannot be verified — and a dataset that cannot be verified should not be the basis for a million-dollar reformulation decision.

Where to look next#

For the broader insight discipline that sentiment feeds, see Consumer Insights. For the verification practices that distinguish credible datasets from extract dumps, see Sample Data Verification. For the social-platform-specific applications, see Social Listening.

Common questions#

Why does sentiment analysis fail in some product categories?#

Generic sentiment models trained on broad consumer text produce lower accuracy in two situations: niche category vocabulary the model was not trained on (specific Chinese skincare ingredient terminology, technical durables jargon, regulated-industry language), and culturally-specific framing where the literal words and the actual sentiment diverge — sarcasm, indirect criticism, or culturally-modulated praise. The fix is category-specific tuning plus access to original posts for analyst spot-checking. A model that scores well on average can still produce systematically wrong results in the categories that matter to a buyer.

Why is "neutral" the most important sentiment label?#

Positive and negative are easy to interpret; the neutral bucket is where the methodology shows. A poorly-tuned model classifies most reviews as neutral by default — which makes positive/negative shifts unreadable. A well-tuned model uses neutral sparingly and only when the text genuinely lacks valence. The neutral-fraction is a quick diagnostic: under twenty percent suggests the model is confident; over fifty percent suggests it is hedging or undertrained on the category.

What is the difference between sentiment polarity and sentiment intensity?#

Polarity is the direction (positive, negative, neutral). Intensity is the magnitude — how strongly positive or negative. Intensity matters more than polarity for early-warning use cases: a small share of very angry reviews on a specific attribute often predicts reformulation pressure better than a large share of mildly negative ones. Datasets that report only polarity miss the intensity signal that makes the data actionable.

How do you validate a sentiment classifier?#

A defensible validation set has three properties: it is human-labelled by analysts familiar with the category (not crowdworkers labelling for cents), it covers the actual platform mix and language mix the model will be applied to (not just English benchmark sets), and it is held out from training. Reported accuracy on a single English-language benchmark is meaningless if the production data is bilingual reviews from a Chinese marketplace. Buyers should ask for category-specific accuracy figures, not headline numbers.