RBR engineering file import cleanup.#556
Conversation
sspagnol
commented
Jul 1, 2019
- Create convertXRengVar.m function to return IMOS name/data/comment (in the same vein as convertSBEcnvVar.m)
- Modify readXR420.m and readXR620.m to use convertXRengVar.m
- Add new pressure variable label 'pres26'
- newer instruments (eg RBRduet^3) have non-ascii character (subscripted 3 in this case), so convert model name with genvarname (result is RBRduet0xB3). Need to decide if this is good enough or do you want to further tidy it up like RBRduet3?
- Create convertXRengVar.m function to return IMOS name/data/comment (in the same vein as convertSBEcnvVar.m) - Modify readXR420.m and readXR620.m to use convertXRengVar.m - Add new pressure variable label 'pres26' - newer instruments (eg RBRduet^3) have non-ascii character (subscripted 3 in this case), so convert model name with genvarname (result is RBRduet0xB3).
|
Ok. I thought genvarname would work that same on 64-bit Windows and Linux Matlab, it doesn't. On Windows you would get RBRduet0xB3, and on Linux I'm getting RBRduet0xFFFD. So I don't know what to do in this case. |
header.model = genvarname(char(unicode2native(tkns{1}{1}, 'US-ASCII')));gives the same result of 'RBRduet0x1A' so then just search-replace '0x1A' with '3'? But then if RBR release a '4' version would have to add another search-replace. |
|
@sspagnol , good refactoring there.
This is bad - now I understand why there are so many comparisons with escaped sequenced strings. IMO, this is a noisy conversion to get a string right. I look into the genvarname help and it's supposedly marked for removal in future releases. did you tried the matlab.lang.makeUniqueStrings or matlab.lang.makeValidName? they return the proper names without escaping sequences in my limited trials. Nonetheless, let's stop depending on this function and matches with escape sequences...it's bad and as above, can cause architecture-dependent bugs. I need 3 things to merge this:
|
modelstring = matlab.lang.makeValidName(tkns{1}{1}, 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0xB3$', '3'); % subscripted 3
header.model = modelstring;The matlab.lang.makeValidName construct with hex replacement gave me identical results on R2019a WIN64 and R2018a GLNXA64. Whether you code the string replacements as just a series of regexprep or as an action if a certain substring is found is up to you. Assumes RBR doesn't go crazy and uses unicode characters above 0xFFFF.
filename = '082533_20190525_0515_eng_rbrduet3.txt'
sample_data = XRParse({filename}, 'timeSeries')
sample_data.meta.instrument_model
ans =
'RBRduet³'whereas I wanted
|
|
The matlab.lang.makeValidName does have a downside. A model string of 'TDR-2050' would be converted to 'TDR0x2D2050' (because you can't have '-' in a Matlab variable name). So one alternative is to just code for the exceptions like modelstring = matlab.lang.makeValidName(tkns{1}{1}, 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0xB3$', '3'); % subscripted 3
modelstring = regexprep(modelstring, '^TR0x2D', 'TR-'); % 'TR-'
modelstring = regexprep(modelstring, '^TDR0x2D', 'TDR-'); % 'TDR-'
header.model = modelstring;Another would be to only convert non-asci characters to their hex equivalents, then do clean up. |
|
Another annoying update : I am still getting inconsistent results between Windows + Linux + version versions of Matlab. Think that the So (on my installation) Windows it is opening as 'windows-1252' encoding, and 'UTF-8' under linux . So choices could be either use the modelstring = matlab.lang.makeValidName(char(unicode2native(tkns{1}{1}, 'US-ASCII')), 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0x1A$', '3'); % subscripted 3
modelstring = regexprep(modelstring, '^TR0x2D', 'TR-'); % 'TR-'
modelstring = regexprep(modelstring, '^TDR0x2D', 'TDR-'); % 'TDR-'
header.model = modelstring;or alternatively use Which do you want to do? |
|
Surly matlab.lang.makeValidName is not really what you want here, if you must what about, modelstring = regexprep(modelstring, '0x2D', '-'); % convert 0x2D to its character what about modelstring = char(unicode2native(tkns{1}{1}, 'US-ASCII')) have not checked if that works |
|
I think I found the problem here.Looks like matlab fopen do not detect the correct encoding in auto-mode: >> fid=fopen(file);
>> [fname,fperm,fformat,fenc] = fopen(fid);
>> fenc
fenc =
'UTF-8'but the file is not UTF-8, otherwise the string below would be correctly printed: >> firstline=textscan(fid,'%s\n');
>> fstr=firstline{1}{1}
fstr =
'Model=RBRduet�'
>>Apparently, we shouldn't rely on Matlab automatic encoding detection. I was surprised not to have an error when reading in UTF-8 mode. I had to loop through the encodings to see what was the correct one. it's some ISO-8859 variant or windows-12** variant. Model=RBRduet� with Big5
Model=RBRduet³ with ISO-8859-1
Model=RBRduetณ with windows-874
Model=RBRduet� with Big5-HKSCS
Model=RBRduetł with ISO-8859-2
Model=RBRduet� with windows-949
Model=RBRduet with CP949
Model=RBRduet³ with ISO-8859-3
Model=RBRduetł with windows-1250
Model=RBRduet with EUC-KR
Model=RBRduetŗ with ISO-8859-4
Model=RBRduetі with windows-1251
Model=RBRduet with EUC-JP
Model=RBRduetГ with ISO-8859-5
Model=RBRduet³ with windows-1252
Model=RBRduet with EUC-TW
Model=RBRduet� with ISO-8859-6
Model=RBRduet³ with windows-1253
Model=RBRduet� with GB18030
Model=RBRduet³ with ISO-8859-7
Model=RBRduet³ with windows-1254
Model=RBRduet with GB2312
Model=RBRduet³ with ISO-8859-8
Model=RBRduet³ with windows-1255
Model=RBRduet� with GBK
Model=RBRduet³ with ISO-8859-9
Model=RBRduet³ with windows-1256
Model=RBRduet│ with IBM866
Model=RBRduetณ with ISO-8859-11
Model=RBRduet³ with windows-1257
Model=RBRduetЁ with KOI8-R
Model=RBRduet³ with ISO-8859-13
Model=RBRduet³ with windows-1258
Model=RBRduetЁ with KOI8-U
Model=RBRduet³ with ISO-8859-15
Model=RBRduet� with US-ASCII
Model=RBRduet≥ with Macintosh
Model=RBRduet� with UTF-8
Model=RBRduetウ with Shift_JISHence, your problem of not showing the string correctly in plots is because the encoding used is wrong. Nonetheless, I don't see a reason to use genvarname (and variants) to transform the string to a valid variable name if this will never be used as a varname. The correct string is also written correctly to netCDF: >> nccreate('nctest.nc','test');
>> ncwriteatt('nctest.nc','test','instrument',fstr)
>> !ncdump -h nctest.nc
netcdf nctest {
variables:
double test ;
test:instrument = "Model=RBRduet³" ;
} |
|
And just an FYI the newer versions of Ruskin (have only tested v2.11.1, not sure when in the v2.x series this occured) no longer has legacy Rtext engineering export option. It now exports a zipped of set of text files consisting of (file == serial number/datetime identifier) file_metadata.txt: yaml file describing the data. file_data.txt: csv file of data, one header line, format described in metadata.txt file_events.txt: log of some event(s), not exactly sure what yet. Have asked for descriptions of new format. Have attached an example. So maybe need to look at #444 (utilizing RBR RSKtools for MATLAB within imos-toolbox) again. |
|
IMO, If RBR is using sqlite file format, than we may also read straight from it instead of relying on another text exported format (json+csv apparently). Unless, of course, these exported files are the only way to touch any processed datastream created by their software. It will also be likely faster than anything else, with rich metadata, and probably without (hopefully) many breaking changes, since any breaks will follow up on their software stack as well. Let's discuss details over #444, maybe we need to arrange some discussion with rbr users and see what is most portable approach here. |
