I have pulled in data from separate data sources. I am of course de-silo-ing the data by making connections across them. However I found a problem. I ran a DISTINCT on the names and got 2 seemingly identical results. Well, okay, to the eye they are identical:
But their space characters are encoded differently. In the example image, the upper name's spaces all have the byte id of 32, and the lower name's spaces all are constructed of the two byte ids of 194 followed by 160.
I need to homogenize these two datasets so that these names are actually comparable. Is there an APOC or other approach, like using apoc.text.replace() to find one kind of space and convert it to the other to homogenize these?
Note: I try to copy and paste the spaces from the Neo4j desktop back into a replace function but I think the encoding gets homogenized when rendered to my screen.
One more thing: I have loaded all the data from the two datasets from csv files that came from large Excel files. Possibly I can change something in Excel or in my Visual Code editor to make them comparable.
This query will retrieve both.
It's hard to see the contents of the replace function, but the first character is a UTF8 space, and the next is an ASCII space.
MATCH (n:Person)
WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;
Hi Koji, I finally was able to find the one instance of the name from my source file that had a different encoding for 2 of its 3 spaces. That name, when compared with an equal sign ( = ) is considered different when compared to all other instances of the name. I learned this through extensive trial and error, copyng and pasting from the original source file into Neo4j. Once I found it, then I was able to build the replace query in the Neo4j browser as you described. I was hoping that there could be a way for me to use Neo4j alone to locate the problem character and fix it...like by using a conversion function. But I didn't find that for UTF8. I tried the only reasonable conversion function I could find: convert to string:
I think what has caused this is a source Microsoft Excel file where a value with a different encoding was pasted into that file, so that there were differnent encodings in the same document...if that's even possible? I exported the excel to a .csv file, and that anomaly was preserved. I would love to be corrected.
I noticed that c2a0 is not a 2-byte space in UTF8, but a Non-breaking Space.
Non-breaking space can be entered on a Mac with option + space.
The first space in the following is option + space
The second one is space.
MATCH (n:Person)
WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;