ReadList problem related to Record type
Your problem is that you've got a slight misunderstanding of the different types of items that ReadList
can read. That's OK, it can be a little confusing.
To begin with: String
, Number
, Expression
, etc. are not sub-types of Record
. They are all separate types with their own rules for how they are read. The RecordSeparators
option is only applied to Records and Words.
Probably if you have some complicated input format whose parsing is best controlled by RecordSeparators/WordSeparators, you should just use Record/Word types, which will give you strings; afterwards, convert the ones you know to be numeric by using ToExpression
.
On the other hand, if you want to gain power over this area of Mathematica, read on.
BASIC CONCEPTS OF READ
Let's make up some terms to help explain things. At their base, the Mathematica functions Read
, ReadList
, and Skip
read in ITEMS from an input stream. An Item is a number like 3.14159e-26
, or a string like "peg and awl"
. There are different TYPES of Items: Record
, Word
, String
, Number
, Real
, Character
, and Byte
. These Types correspond to either a string or a number, with different rules for how the input is parsed. There is also an Expression
Type which corresponds to an Item which is a general Mathematica expression of any form.
WHAT ARE THE ITEMS AND HOW DO THEY WORK?
The simplest cases:
- a
Character
is a single character from the stream, represented as a one-letter string. - a
Byte
is a single character from the stream, represented as an integer that is the Character Code of the character.
Both Character
and Byte
have the same rule -- read one character -- but different representations, string vs. number.
To describe the other Item Types, you need to understand that they are all assembled from sequences of characters read from the stream, with some particular character that marks their end. That character is not part of the Item being read! It is a TERMINATOR, or TERMINATING CHARACTER. We say that the Item was TERMINATED by a particular character in the stream.
- a
Record
is a string, a sequence of characters terminated by aRecordSeparator
. - a
Word
is a string, a sequence of characters terminated by aWordSeparator
,RecordSeparator
, orTokenWord
. - a
String
is a sequence of characters terminated by a newline (\n
character). Basically it's a LINE of text input, starting at the current stream position. - a
Number
is any sequence of characters that can be interpreted as a number (in Fortran syntax), terminated by any character that can't be part of the number. Any whitespace (spaces, newlines, tabs) preceding the number is quietly skipped over first. - a
Real
is the same as aNumber
but it's always a floating-point value, never an integer. - an
Expression
is a sequence of one or more newline-terminated lines that form a parseable Mathematica expression. It's terminated by whatever newline ends the last line of the expression. If you've ever typed in a multi-line input to the raw text kernel, you know how this works.
Records, Words, Strings, and Characters become Mathematica strings.
Numbers, Reals, and Bytes become Mathematica numbers.
Expressions become Mathematica expressions.
Numbers, Reals, Strings, and Expressions pay no attention to RecordSeparators and WordSeparators. They have their own rules for when they stop taking characters from the stream.
(The end of the stream, represented in Mathematica by the symbol EndOfFile
, is nearly always a terminator. It's not a character, though.)
OBJECTS: GROUPS OF ITEMS
I have just told you the only Types of Items that can be read. However, there's another term that has to be introduced. The second argument of Read
, ReadList
, and Skip
-- the input specification -- can be a complex expression which contains one or more of these Types. Let's call that an OBJECT. For instance,
Read[stream, {String, Number, Plus[Number, Real], Hold[Expression]}]
reads an Object: a sequence of five Items. Several of the Items are placed inside larger expressions, and the whole thing is placed inside a List
head.
The degenerate case of an Object is a single naked Item: ReadList[stream, Byte]
If you don't specify a second argument to Read, ReadList, or Skip, it defaults
to Expression. ReadList[stream] == ReadList[stream, Expression]
Read, ReadList, and Skip proceed left to right through the Object; each Item Type they encounter causes an Item to be read from the stream. As I listed above, each Type has its own rules for how many characters it will snatch up, what it will do with them, and when it will stop.
If you are constructing complex Objects consisting of several Item Types, you need to know especially when will each one stop. This requires understanding the TERMINATORS for each type. Just as importantly, you need to know what is done with those terminating characters.
WHAT HAPPENS TO THE CHARACTERS THAT TERMINATE AN ITEM
Terminating characters are not part of the Item that is read. They simply mark that Item's end in the stream. Different Types apply different rules to how they treat the terminator -- that is, where they leave the position of the stream pointer after they are done.
Bytes and Characters don't have terminators, of course.
Expressions have terminators that are newlines. The stream pointer is left sitting at the newline. An Expression like
1+2*
3/4-
5
has three newlines in it, at the end of each line. The newline after 5
is the terminator for this Expression, and after Read[stream, Expression]
that character's position is the stream's position. StreamPosition[stream] == 11
. If you followed Read[Expression]
with Read[Character]
you'd get a \n
.
Strings also have newline terminators. But they CONSUME the newline, skipping over it, leaving the stream pointer after it. The newline character is not part of the String, but if you read a Character after reading a String you wouldn't get a \n
, you'd get whatever is at the beginning of the next line.
Numbers, like Expressions, do not consume their terminating characters. They leave the stream pointer at that character, whatever it is. For instance, if you read a Number and then a Character from "64+32*3"
, the Number would be 64
, and the Character would be "+"
. I think you can see why this is what you want.
Records and Words leave the stream pointer pointing at whatever character terminated them. This character is a RecordSeparator or WordSeparator; only Records or Words care about those options. However, if you then read another Record or Word subsequently, the stream will first SKIP OVER the RecordSeparator or WordSeparator that the stream is pointing to, the terminator for the previous Record or Word. Then it will proceed to read the next Record or Word. (Exception: this skipping does not happen if you're about to Read another Word
and the separator was a TokenWord
.)
If the input stream were
an,a,tev,ka
0123456789
and RecordSeparators->","
, then reading one Record would give you "an"
, and the stream position would be 2. If it reads another Record then it will skip the comma, move to position 3, and then read "a"
. The stream position would be left at 4, the second comma.
In general this is what you want. You want to be able to read multiple Records without having the terminating characters interfere, but you also want to be able to grab those characters if desired. You might have RecordSeparators->{"+", "-", "*", "/"}
, and you need to inspect the Character after the Record to find out which particular separator stopped it.
I hope this is an adequate explanation. I am not going to talk about Record and Word behavior when you have left-and-right matched delimiters as RecordSeparators or WordSeparators (like parentheses); nor NullRecords
and NullWords
; nor RecordLists
; and there's one very useful special case where Numbers can consume RecordSeparator terminators. Please let me know if there's anything unclear and I'll hack on this response to make it unclearer.
I suspect you want something like this:
str = "Tue 1 Jan 2013 23 : 00 : 01; 17; {}; 32.5; 0.\nTue 2 Jan 2013 2 : 20 : 01; 47; \
{3,4}; 3.5; 110."
blank[] = blank[Character];
blank[_String] = Sequence[];
expr[] = expr[Word];
expr[x_String] := ToExpression[x]
ReadList[StringToStream[str],
{Word, blank[], Number, blank[], expr[], blank[], Number, blank[], Number},
WordSeparators -> {"\n", ";"}
]
{{"Tue 1 Jan 2013 23 : 00 : 01", 17, {}, 32.5`, 0.`}, {"Tue 2 Jan 2013 2 : 20 : 01", 47, {3, 4}, 3.5`, 110.`}}
I'm not sure the problem here lies with the mixture of types for the objects to read, it seems to be related to the record separators.
str = "Tue 1 Jan 2013 23 : 00 : 01; 17; {}; 32.5; 0.\nTue 2 Jan 2013 \
2 : 20 : 01; 47; {3,4}; 3.5; 110.";
If the semicolon separators are replaced with linefeeds, everything works as expected.
ReadList[StringToStream[StringReplace[str, {";" -> "\n"}]],
{String, Number, Expression, Number, Number}]
{{"Tue 1 Jan 2013 23 : 00 : 01", 17, {}, 32.5, 0.}, {"Tue 2 Jan 2013 2 : 20 : 01", 47, {3, 4}, 3.5, 110.}}
So as you mention one solution would be to replace the separators in the input file.
If you are on Linux
or OS X
something along the lines of sed -i 's/;/\n/g' myfile.dat
will work. Taking a backup might be useful as this is an in place replacement.