Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. (Wikipedia) So I had a Crowdmap full of data sitting there. It had to be cleaned ready for further analysis by researchers. This blog is how I managed to clear the irrelevant/corrupt data to produce an accurate database. After downloading the large CSV file of data it was obvious for a start that there was an issue with trusted reports. There were 2084 reports with no geo locations or categories, a total mixture of everything. They were from all forms of entering information on the platform. The reports marked trusted were not able to be cleaned as the information had not been reviewed, just approved and verified. So there were so many categories missing and the majority had not be geolocated. The main learning from me is there has to be a quality control team as sadly there was so much data not useable Having to go through these one by one to check then was extremely time consuming. This is really worth remembering that you set criteria out before people class report as trusted. Following the guidelines that Ushahidi had already published on the Wiki: I removed all personal information from every report. Removed all data that could not be geolocated. Removed all duplicates The most simple, less time consuming way was to use filters in Excel. How many sms came into platform 4372 (Taken from information on dashboard) How many SMS were turned into reports 17 (Taken from clean data) How many SMS were approved 17 (Taken from clean data) Verified 8 (Taken from clean data) Ultimately mapped. 17 (Taken from clean data The SMS reports ended up as just 17, the reasons for this are: Not relevant to deployment Not enough information to be useful Many had no geolocation possible. Some SMS were just put on platform as "trusted source" with no information. Not stated as an SMS when report was created. (thus unable to state which was a SMS and which were not. Which does not show the true end figure of over 1600 clean data reports and how many were actually SMS) A point worth remembering if it is a SMS then make sure SMS is on report when it is created. Having 70 categories was also a challenge. This is before the data was cleaned:
Geolocated
2339
Trusted Reports
1907
No Need To Translate
1470
Everything Fine
741
Translated
684
Polling Station Logistical Issues
418
Impossible to Geolocate
323
Other
252
Voting Irregularities
201
Threat of violence
178
Unresolved
170
Unverified
149
BVR Issues
142
Voter Integrity Irregularities
139
Civilian Peace Efforts
127
Provisional Citizen Results
110
Voter Register Irregularities
106
Violent Attacks
104
GEOLOCATION
98
IEBC Officials not Acting In Accordance to Set Rules
81
Voter Identification Issues
60
Absence/Insufficient IEBC Officials At Polling Station
55
Irregularities with voter assistance
54
Mobilisation towards violence
51
Missing/Inadequate Voting Materials
49
Counting Irregularities
47
Fear and Tension
45
Rumours
41
TRANSLATION
38
Demonstrations
33
Ballot Box Irregularities
32
Absence/Insufficient law enforcement officials at Polling Station
32
Eviction/population displacement
29
Property Loss/Theft
25
Police Peace Efforts
25
Polling station logisitcal issues
23
Polling Station Closed Before Voting Concluded
22
Campaign material in polling station
21
Ambush
20
Protest over declared results
20
Purchasing of Voters Cards
20
Observers/Media Blocked From Entering Polling Station
19
Hate Speech
18
Resolved
18
POSITIVE EVENTS
17
Irregularities with transportation of ballot boxes
15
Party Agent Irregularities
14
Verified
14
Failure to Announce Results By IEBC official
13
Presence of weapons
13
To Be Geolocated
11
Riots
11
Sexual and Gender Based Violence
8
To Be Translated
8
Armed Clashes
6
Voters Issued Invalid Ballot Papers
5
URGENT
4
Abductions/kidnapping
4
SMS-V
4
Bombings
4
Ballot Boxes Destroyed After Announcing Final Results
3
Purchase of weapons
3
certificate Issues
2
Polling Station Administration
1
Voting Issues
1
After using the filter functions in excel, I manually had to go through each line of the CSV file making sure I had not missed anything. Please if anything can be learnt from this data cleaning post deployment, it is that quality control HAS to be in place during deployment, This is key to gaining accurate information. I gained so much personally from completing this task. I hope it helps you when performing a deployment and want to use results post event. If anyone has any questions I am always available to answer them. Happy Mapping Folks, Jus