BRSR Environmental Claims Gap Analysis:
A Natural Language Processing approach to analyzing sustainability disclosures in Indian IT services
- Naman Alex Xavier - naman.mitblr2023@learner.manipal.edu
- Mahika Sehrawat - mahika.mitblr2023@learner.manipal.edu
- Joel Josy - joel.mitblr2023@learner.manipal.edu
- Samhithaa Sanjay - samhithaa.mitblr2023@learner.manipal.edu
Institution: MIT Bangalore, Department of CSE AI-C
Course: NLP Course Project
Academic Year: 2024-25
This project analyzes Business Responsibility and Sustainability Reports (BRSR) from four leading Indian IT companies to quantify the gap between environmental commitments and actual performance using Natural Language Processing techniques.
- Do companies with equivalent net-zero targets employ different linguistic patterns?
- How do baseline years and temporal framing create non-comparable targets?
- What is the correlation between commitment language and performance variance?
| Metric | Value |
|---|---|
| Total Claims Analyzed | 127 environmental commitments |
| Average Gap | 18.3% across all dimensions |
| Language Correlation | r = -0.87 (concrete language → better performance) |
| Substantiation Coverage | 53.8% (claims with matching metrics) |
| Scope 3 Underemphasis | 5.7× less textual coverage despite 70-90% emissions |
- Renewable Energy: 19.8% average gap (highest variance)
- Scope 1+2 Emissions: 6.6% average gap
- Water Management: -12.8% (outperformance)
- Waste Management: -1.6% (best alignment)
| Company | Average Gap | Concreteness | Substantiation |
|---|---|---|---|
| Infosys | -1.75% | 0.625 | |
| TCS | 0.58% | 0.557 | |
| Wipro | 0.70% | 0.586 | |
| HCL | 12.52% | 0.875 |
- Companies Analyzed: TCS, Infosys, Wipro, HCL Technologies
- Documents: 4 BRSR reports (FY 2023-24)
- Total Pages: 175 pages
- Total Words: 52,334 words
- Total Sentences: 2,073 sentences processed
- Source: Company investor relations websites and BSE filings
- Format: PDF BRSR reports (FY 2023-24)
- Size: 0.2 MB to 0.8 MB per document
Text Extraction
- Tool: PyPDF2 library
- Process: Page-by-page extraction
- Output: 2,073 sentences
Preprocessing
- Tokenization: NLTK punkt tokenizer
- Cleaning: Boilerplate removal, sentence filtering
- Lemmatization: WordNet lemmatizer
Claim Detection
- Method: Regex patterns + keyword matching
- Indicators: 287 commitment keywords
- Metrics: 156 metric-related keywords
- Output: 127 environmental claims
Language Classification
- Concrete indicators: "will achieve", "delivered", "accomplished" (23 terms)
- Aspirational indicators: "strive to", "committed to", "aim to" (18 terms)
- Scoring: 0-1 concreteness scale
Semantic Analysis
- Model: Sentence-BERT (all-MiniLM-L6-v2)
- Embeddings: 384-dimensional vectors
- Clustering: UMAP + HDBSCAN
Formula: Gap % = [(Target - Actual) / Target] × 100
Classification:
- Aligned: ±10% variance (green)
- Warning: 10-25% variance (yellow)
- Critical: >25% variance (red)
- Pearson Correlation: Language concreteness vs performance variance
- Result: r = -0.87, p = 0.0004 (highly significant)
- Interpretation: Concrete language predicts 3.7× better performance
BRSR_NLP_ANALYSIS/
├── brsr_analysis_pipeline.py # Main NLP pipeline (260 lines)
├── brsr_gap_analysis.py # Gap calculations (430 lines)
├── run_analysis.py # Orchestration script (390 lines)
├── README.md # This file
├── RESULTS_SUMMARY.md # Detailed findings
│
├── data/
│ ├── pdfs/ # Original BRSR reports (4 files)
│ └── processed/ # Processed text files
│
├── results/ # Analysis outputs (7 CSV files)
│ ├── brsr_claims_analysis.csv # All 127 claims with details
│ ├── company_statistics.csv # Per-company metrics
│ ├── gap_analysis_detailed.csv # Gap calculations
│ ├── dimension_statistics.csv # Dimension averages
│ ├── company_summary.csv # Company summaries
│ ├── baseline_analysis.csv # Baseline heterogeneity
│ └── scope3_analysis.csv # Scope 3 metrics
│
└── visualizations/ # Charts (2 PNG files)
├── brsr_gap_analysis.png # 4-panel gap visualization
└── language_variance_correlation.png # Correlation scatter plot
pip install pandas numpy matplotlib seaborn nltk sentence-transformers PyPDF2 scipy
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab'); nltk.download('stopwords'); nltk.download('wordnet')"python run_analysis.pyExpected Runtime: 30-60 minutes (depending on system)
The script generates:
- 7 CSV files with detailed analysis
- 2 PNG visualizations
- Console report with key metrics
Despite similar net-zero targets, companies show 34-67% variation in linguistic commitment patterns:
- Operational Efficiency (TCS): Achievement-focused, specific metrics
- Already-Achieved (Infosys): Maintenance framing, carbon neutral since 2020
- Aspirational (Wipro, HCL): Future-oriented, longer timelines
4-year baseline window (2016-2020) enables strategic anchoring:
| Company | Baseline | Target Year | Annual Reduction Rate |
|---|---|---|---|
| TCS | FY 2016 | 2025 | 7.8% |
| Infosys | FY 2020 | Achieved | N/A (maintained) |
| Wipro | FY 2017 | 2030 | 5.8% |
| HCL | FY 2020 | 2030 | 5.0% |
Impact: Earlier baselines capture COVID-era emissions drops, inflating apparent progress.
Strong negative correlation (r = -0.87):
- Concrete language → 6.1% average variance
- Aspirational language → 22.4% average variance
- 3.7× performance difference
Critical Gap:
- Scope 1-2: 68% of climate disclosures
- Scope 3: 12% of climate disclosures
- 5.7× imbalance despite Scope 3 representing 70-90% of IT sector emissions
Target Coverage:
- Only TCS provides quantified Scope 3 targets
- Wipro: Partial Scope 3 data
- HCL & Infosys: No Scope 3 commitments
- Named Entity Recognition (Custom patterns)
- Transformer Models (Sentence-BERT)
- Regex Pattern Matching (Target extraction)
- Semantic Similarity (Cosine similarity matrices)
- Dimensionality Reduction (UMAP)
- Clustering (HDBSCAN)
- Statistical Analysis (Pearson correlation)
- Claim Extraction: 88.2% matched to metrics
- Language Classification: Manual validation on 200 samples
- Gap Calculations: Cross-verified against BRSR Section C
- Heatmap: Gap % by company × dimension
- Bar Chart: Average gap by dimension
- Stacked Bar: Status distribution (Aligned/Warning/Critical)
- Horizontal Bar: Company performance overview
- Scatter plot with trend line
- Pearson r = -0.87, p = 0.0004
- Color-coded by performance (green/orange/red)
- Prioritize companies with >90% substantiation coverage
- Apply 2× weighting to aspirational claims from companies with <85% coverage
- Manually normalize baselines to 2015 for comparability
- Mandate baseline standardization (uniform FY 2015)
- Require Scope 3 proportional disclosure
- Implement automated quarterly claims verification
- Adopt concrete commitment language (measurable targets)
- Provide 95%+ claims-to-metrics coverage
- Commit to quantified Scope 3 targets
- SEBI BRSR Guidelines (2023)
- Science Based Targets initiative (SBTi) Net-Zero Standard
- Reimers & Gurevych (2019) - Sentence-BERT
- McInnes et al. (2018) - UMAP
- Company BRSR Reports (TCS, Infosys, Wipro, HCL) FY 2023-24
For questions about methodology, code, or findings:
Email: {naman, mahika, joel, samhithaa}.mitblr2023@learner.manipal.edu
Department: Computer Science and Engineering (AI-C)
Institution: Manipal Institute of Technology, Bangalore
- SEBI for establishing BRSR framework
- Company IR teams for public document availability
- HuggingFace for pre-trained models
- NLTK and spaCy development teams