EN · DE · RU · FR · ES

#1876: CSVParser.java

projectforge-common/src/main/java/org/projectforge/common/CSVParser.java Utility class — org.projectforge.common package, projectforge-common/src/main/java/org/projectforge/common/CSVParser.java 342 lines · 248 code · 57 comments · 37 blank
Custom CSV (Comma-Separated Values) parser with a hand-written lexer/parser architecture. Reads from a java.io.Reader and tokenizes CSV data one character at a time, supporting quoted fields, embedded newlines within quoted cells, escaped double-quotes ("" convention), configurable field separators, header column name mapping, and UTF-8 BOM detection. Written by Kai Reinhard and H. Spiewok (2005), this predates and avoids external CSV library dependencies.

Architecture

Imports

Parser Architecture — Hand-Written Lexer

Rather than using regex or a parser generator, CSVParser implements a character-by-character lexer with a pushback buffer. This design prioritizes control over error handling and performance for the specific subset of CSV formatting used by ProjectForge.

Core Components

Type Enum

enum Type { EOF, EOL, CHAR }

Token types: End-Of-File, End-Of-Line, or character data. This drives the parser state machine.

Character Stream Management

UTF-8 BOM Handling

skipBOM() is called during construction to detect and skip a UTF-8 Byte Order Mark (\uFEFF) at the start of the file. If no BOM is present, the first character is pushed back (unread). This enables correct parsing of CSV files exported from Microsoft Excel, which includes a BOM for UTF-8 files.

Cell Parsing (parseCell)

The core CSV parsing logic handles these cases:

CaseBehavior
Unquoted cellCharacters are accumulated until separator or EOL
Quoted cell ("...")Characters inside quotes accumulated; quotes must be properly closed
Escaped quote ("")Two consecutive double-quotes inside a quoted cell represent one literal quote character
Embedded newlineNewlines within quoted cells are preserved (multiline cell values)
Trailing whitespaceWhitespace after closing quote is skipped; expects separator or EOL next
Unterminated quoteThrows RuntimeException with descriptive error message including line/column number

Line Parsing (parseLine)

Reads cells until EOL or EOF, collecting them into a List<String>. Returns null at EOF (not an empty list — callers can distinguish end-of-file from empty lines).

Header Column Support (parseHeadCols / getCell)

For CSV files with a header row, parseHeadCols() reads the first line and builds a colMap: Map<String, Integer> mapping column names to their positional index. Subsequent getCell(List<String>, colname) calls retrieve values by column name rather than position. This enables Excel-like named column access.

Error Messages

Three distinct error constants provide specific diagnostics:

ERROR_UNEXPECTED_QUOTATIONMARK = "Unexpected quotation mark \" (only allowed in quoted cells)."
ERROR_QUOTATIONMARK_MISSED_AT_END_OF_CELL = "Quotation \" missed at the end of cell."
ERROR_DELIMITER_OR_NEW_LINE_EXPECTED_AFTER_QUOTATION_MARK = "Delimiter or new line expected after quotation mark."
ERROR_UNEXPECTED_CHARACTER_AFTER_QUOTATION_MARK = "Unexpected character after quotation mark."

Each message is augmented with line and column numbers via createMessage().

Integration with CSVWriter

CSVParser uses CSVWriter.DEFAULT_CSV_SEPARATOR_CHAR (';' — semicolon) as its default separator. This is the European CSV convention (Microsoft Excel in German locales uses semicolon-delimited CSV). The separator is configurable via setCsvSeparatorChar().

Design Limitations

This custom implementation was written in 2005, well before Apache Commons CSV (released 2014) or OpenCSV became widely available. At the time, the JDK had no built-in CSV support. The code has been maintained with incremental improvements: BOM handling (2024 commit), multiline quoted field support, and typo fixes via codespell.

Git History

868d6abb7 2025 -> 2026
161d71602 WIP: CSVParser: BOM chars.
dfb2378df WIP: CSVParser: multilines etc.
63081666f Source file headers: 2024-> 2025.
a73905c14 Fix typos in projectforge*/ directories Found via codespell
a72903e36 *.java, *.kt: StringBuffer -> StringBuilder.
b6092df09 Copyright 2023 -> 2024
ab45d51fa Copyright 2001-2022 -> 2001-2023.