#1876: CSVParser.java

Imports

java.io.IOException / java.io.Reader — Character stream input
java.util.ArrayList, HashMap, List, Map — Result storage and header column indexing
org.apache.commons.lang3.StringUtils — String blank-check in error messages
org.slf4j.Logger — Error logging

Parser Architecture — Hand-Written Lexer

Rather than using regex or a parser generator, CSVParser implements a character-by-character lexer with a pushback buffer. This design prioritizes control over error handling and performance for the specific subset of CSV formatting used by ProjectForge.

Core Components

Type Enum

enum Type { EOF, EOL, CHAR }

Token types: End-Of-File, End-Of-Line, or character data. This drives the parser state machine.

Character Stream Management

Pushback buffer: A 5-element int[] (pushbackBuffer) with index tracking — enables lookahead and backtracking without Reader support for mark()/reset()
read(): Returns next character from pushback buffer (if any) or underlying Reader. Tracks line numbers on \n
unread(int): Pushes a character back onto the buffer, adjusting line/column counters
nextToken(): Core tokenizer — returns the next Type (EOF, EOL, CHAR) and sets cval for character tokens. Handles \r\n (Windows CRLF) as a single EOL token

UTF-8 BOM Handling

skipBOM() is called during construction to detect and skip a UTF-8 Byte Order Mark (\uFEFF) at the start of the file. If no BOM is present, the first character is pushed back (unread). This enables correct parsing of CSV files exported from Microsoft Excel, which includes a BOM for UTF-8 files.

Cell Parsing (parseCell)

The core CSV parsing logic handles these cases:

Case	Behavior
Unquoted cell	Characters are accumulated until separator or EOL
Quoted cell (`"..."`)	Characters inside quotes accumulated; quotes must be properly closed
Escaped quote (`""`)	Two consecutive double-quotes inside a quoted cell represent one literal quote character
Embedded newline	Newlines within quoted cells are preserved (multiline cell values)
Trailing whitespace	Whitespace after closing quote is skipped; expects separator or EOL next
Unterminated quote	Throws RuntimeException with descriptive error message including line/column number

Line Parsing (parseLine)

Reads cells until EOL or EOF, collecting them into a List<String>. Returns null at EOF (not an empty list — callers can distinguish end-of-file from empty lines).

Header Column Support (parseHeadCols / getCell)

For CSV files with a header row, parseHeadCols() reads the first line and builds a colMap: Map<String, Integer> mapping column names to their positional index. Subsequent getCell(List<String>, colname) calls retrieve values by column name rather than position. This enables Excel-like named column access.

Error Messages

Three distinct error constants provide specific diagnostics:

ERROR_UNEXPECTED_QUOTATIONMARK = "Unexpected quotation mark \" (only allowed in quoted cells)."
ERROR_QUOTATIONMARK_MISSED_AT_END_OF_CELL = "Quotation \" missed at the end of cell."
ERROR_DELIMITER_OR_NEW_LINE_EXPECTED_AFTER_QUOTATION_MARK = "Delimiter or new line expected after quotation mark."
ERROR_UNEXPECTED_CHARACTER_AFTER_QUOTATION_MARK = "Unexpected character after quotation mark."

Each message is augmented with line and column numbers via createMessage().

Integration with CSVWriter

CSVParser uses CSVWriter.DEFAULT_CSV_SEPARATOR_CHAR (';' — semicolon) as its default separator. This is the European CSV convention (Microsoft Excel in German locales uses semicolon-delimited CSV). The separator is configurable via setCsvSeparatorChar().

Design Limitations

No streaming: Each call to parseLine() reads one line and returns all cells — suitable for moderate-sized files but not for very large (multi-GB) CSVs
Fixed pushback buffer: The 5-character pushback limits lookahead; sufficient for CSV escaping patterns but not for general parsing
No type conversion: All values are returned as strings; callers must parse numbers, dates, etc.
Semicolon default: Uses European CSV convention; must be explicitly changed for comma-separated files
Error handling: Uses RuntimeExceptions rather than checked exceptions or error recovery — malformed input halts parsing

This custom implementation was written in 2005, well before Apache Commons CSV (released 2014) or OpenCSV became widely available. At the time, the JDK had no built-in CSV support. The code has been maintained with incremental improvements: BOM handling (2024 commit), multiline quoted field support, and typo fixes via codespell.

#1876: `CSVParser.java`

Architecture

Imports

Parser Architecture — Hand-Written Lexer

Core Components

Type Enum

Character Stream Management

UTF-8 BOM Handling

Cell Parsing (parseCell)

Line Parsing (parseLine)

Header Column Support (parseHeadCols / getCell)

Error Messages

Integration with CSVWriter

Design Limitations

Git History