We live in a data-driven world, but in India there are figures and then there are facts

Against the backdrop of India's general election, for which voting ends today, the Mint newspaper published a story earlier this month that caused great concern among politicians, analysts and anyone with a passing interest in maintaining a functioning economy. Data used to measure the country's performance is riddled with inaccuracies and, apparently, based on companies that do not even exist.

However, it is not the first time something like this has happened. In September 2013, The Hindu published an article on national crime statistics that sent shockwaves through sections of Indian academia and policy punditry. The writer, named Rukmini S, revealed that the National Crime Records Bureau (NRCB) had been systematically under-reporting millions of crimes in official documentation for years. Those same documents had been widely used for analysis and policy formulation.

These errors were a direct result of a peculiar problem with the way data was collected in cases of crime that involved more than one type of offence. If a crime involved both, say, robbery and murder, the NCRB data-gathering methods only recorded what it referred to as the “primary offence”. In other words, in the case of a multi-offence crime, only the one that attracted the severest punishment was included in the NCRB’s annual reports. This meant that many millions of secondary offences simply went missing from official reports on Indian crime.

It was a staggering revelation, and not so much because of the methodology itself. For all its faults, the "primary offence" model is a common form of crime reporting in many other countries. The bigger problem was that the official reports made no mention of this mode of data collection. It was only in 2014, months after The Hindu broke the initial story, that the NCRB began to include appropriate disclaimers in its work.

Shortly afterwards, in December 2013, the Times of India reported that the 2011 census may have dramatically undercounted the number of disabled people in the country, especially in rural areas. Despite a particular focus on capturing granular disability data in the 2011 Indian census, several factors explain why India reports far lower rates of disability than many others countries in the region. These include the deep-rooted stigma associated with discussing disability in families, problems defining what disability actually meant for the purposes of the survey, and insufficient training of census staff. More recently, Nandita Saikia and Mukesh Parmar, researchers from Jawaharlal Nehru University in Delhi, found that disability among India's elderly population, reported at just five per cent in the census, is probably about four times as prevalent.

More recently, in July 2017, CK Mishra, who was then health secretary, told attendees at a think tank conference that India’s health statistics were inadequate. The latest edition of the National Family Health Survey (NHFS), he said, was unreliable for several of the nation’s states. Underlining his statements, experts have since pointed out that the NFHS suffers from problems of poor data quality, insufficient population coverage and infrequent updates. Much like the census, many of those who worked on the survey were ill prepared for a job that was poorly thought out in the first place. The NFHS questionnaire on women’s health, for instance, comprised 1,139 questions across 93 pages. Some might describe that as a health hazard in its own right.

What then of India’s economic data? Despite some recent quibbles about the way in which GDP calculations have been revised under the Modi administration, the nation has enjoyed a reputation for collecting broadly reliable statistics on the state of its economy. However, maybe it is time now to reappraise even that impression.

The Mint report stated that the MCA-21 – a key Indian database of companies, compiled by the Ministry of Company Affairs – was filled with closed, untraceable or miscategorised firms. A study of the database by a second government agency, the National Sample Statistics Organisation, found that a walloping 39% of MCA-21 companies fell in one of these three categories.

The problem is that the MCA-21 is one of the most important datasets used by the government to estimate the size and growth of Indian economic activity. What is more, this also means official Indian estimates of the shares of various sectors in the economy may also be inaccurate.

The impact these missing companies may have on Indian GDP calculations may be minimal, or it may not. But that is not the point. This discovery is yet another hammer blow to the crumbling facade of India’s data-gathering institutions – state-funded bodies that public servants and ordinary citizens should be able to trust implicitly.

In today’s increasingly networked, digital world, the value of data has rocketed. In fact, many now refer to it as the new oil. In India’s case, though, it is starting to look like something rather more old-fashioned: snake oil.

Sidin Vadukut is an Indian author and historian who lives in London

We live in a data-driven world, but in India there are figures and then there are facts

From the the economy to crime, healthcare and its census, the nation's statistics are being proven increasingly unreliable