Teradata NotesThese are my personal notes that I use as a quick help in my work.
|
|
See if database is running:
pdestate -a
start database:
/etc/init.d/tpa start
select * from dbc.databases;
– I notice that we have schemas, accounts, and owners
select * from dbc.tables where tablename like '...';
select * from dbc.users;
HELP SESSION
: to get some information on the current session
https://www.teradata.com/University/Overview
fonts, etc: menu > window > preferences > general > appearance
There is a reset button too
The DBC user/schema has many useful views.
Some "schemas" – things listed in the explorer – are users, and some are databases.
To see only the databases, and not the users: menu > window > preferences > teradata datatools > data source explorer load.
The icon distinguishes between users and databases, or you can do right-click and see the DDL
The pattern I see is:
Result set viewer: wrench on the resultset bar, or menu > window > preferences > terdata datatools > result set viewer
alternate rows in color: enable
max display row count: 500
On history, useful columns are
note: "notes" allows adding a note to the history
Enable suggestions: menu > SQL > code assist autoactivation
projects:
stored in the workspace
See where the projects are stored: right-click on the project in the explorer
Ctrl+/ to add or remove commenting
DATABASE db-nm
– sets the current database to db-nm
SELECT DATABASE
– shows current database
ctrl+space: templates
Add templates to preferences > data management > sql development > sql editor > templates
Query pause button: red button in upper right
CREATE DATABASE dbname
AS
PERM=0;` -- minimum
CREATE DATABASE dbname
FROM parentdb
AS -- generally parentdb is DBC
PERM = 6400000000
SPOOL = 39662995374
TEMPORARY = 39662995374 -- not mandatory
FALLBACK -- not mandatory
ACCOUNT= 'accountname' -- not mandatory
NO AFTER JOURNAL -- not mandatory
NO BEFORE JOURNAL; -- not mandatory
(It looks like FALLBACK if there is PERM>0, otherwise NO FALLBACK)
Permanent space is space available for permanent objects
Spool space is the part of permanent space available for intermediate results
Temp space is the unused part of the permanent space, used for Global Temporary tables
TMODE=TERA
TERA is case INsensitive on comparing strings
TMODE=ANSI
ANSI is case sensitive on comparing strings
SHOW TABLE tbl-nm;
Show the DDL statement (error if a view)
SHOW VIEW vw-nm;
Show the DDL statement
HELP TABLE vw-nm;
Return list of columns as result set, for table or view. This includes comments.
HELP STATS tbl-nm;
Return statistics for table or view
Copy a table:
CREATE TABLE db.tablename -- use "..." to surround names if necessary
AS db.sourcetablename
WITH DATA;
CREATE TABLE db.empty_table -- creates an empty table
AS db.sourcetablename
WITH NO DATA; -- notice "NO"
Create based on a query
CREATE TABLE tablename
AS (SELECT ...)
WITH [NO] DATA;
Minimal table creation:
CREATE TABLE db.tb_nm -- best practice: put database name
( a VARCHAR(10) TITLE 'the name' NOT NULL
, n INTEGER TITLE 'num' FORMAT 'zzz9'
, ind CHAR(1) TITLE 'Important'
)
UNIQUE PRIMARY INDEX( a );
Additional configurations:
CREATE [SET|MULTISET] TABLE db.tb_nm -- multiset accepts duplicates
, [NO] FALLBACK -- a copy is in another AMP
, NO BEFORE JOURNAL
, NO AFTER JOURNAL
, CHECKSUM = [ALL|NONE|DEFAULT]
, DATABLOCKSIZE = ... -- small for OLTP, large for OLAP, usually specified by DBA
, MERGEBLOCKRATIO = 50 PERCENT -- less used
, FREESPACE = 50 PERCENT -- less used
, MAP = TD_MAP1 -- less used, see maps with select MapName from dbc.maps;
( ....
)
UNIQUE PRIMARY INDEX( a );
With journal:
CREATE [SET|MULTISET] TABLE db.tb_nm -- multiset accepts duplicates
, BEFORE JOURNAL
, AFTER JOURNAL
, WITH JOURNAL = tb_nm_2
SHOW TABLE tb_nm;
– show the DDL
DROP TABLE tb_nm;
SELECT databasename, tablename, columnname FROM dbc.columns WHERE UPPER(columnname) LIKE UPPER('%event%');
SELECT databasename, tablename FROM dbc.tables WHERE UPPER(tablename) LIKE UPPER('%event%');
SELECT databasename FROM dbc.databases;
SELECT * FROM dbc.databases;
Derived table: basically a named table in a query. Lives during a query
Volatile table:
lives in a user session (not in data dictionary)
CREATE [SET|MULTISET] VOLATILE TABLE tb_nm AS (SELECT ....) WITH DATA;
By default, the rows are dropped on commit. To keep:
CREATE [SET|MULTISET] VOLATILE TABLE tb_nm AS (SELECT ....) [WITH DATA] ON COMMIT PRESERVE ROWS;
In spool space
Global temporary table:
lives during a user session, and visible to other sessions (in data dictionary, other users can see)
CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tb_nm AS (SELECT ....) WITH DATA;
By default, the rows are dropped on commit. To keep:
CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tb_nm AS (SELECT ....) [WITH DATA] ON COMMIT PRESERVE ROWS;
SET tables do NOT contain duplicate records
MULTISET tables contain duplicate records
ALTER TABLE tb_nm ADD col_nm data_type;
ALTER TABLE tb_nm RENAME col_nm TO col_nm2;
ALTER TABLE tb_nm DROP col_nm;
BYTEINT
(-128 to 127)
SMALLINT
(2 times bigger)
INT, INTEGER
(again twice as big, about -2*10**9 to 2*10**9
)
BIGINT
(8 bytes long)
DECIMAL(m,n)
(1 to 16 bytes), m = total # digits, n = precision
Does rounding for precision, but fails when too many digits before decimal
NUMBER(m,n)
(1 to 16 bytes), m = total # digits, n = precision. Idem about number of digits.
FLOAT,REAL
(8 bytes, IEEE format, from 10**-308 to 10**308
)
CHAR
(fixed)
VARCHAR
(n): variable, with n up to 64K characters
LONG VARCHAR
: more than 64K char
DATE YYYYMMDD
4 bytes
TIME
6 or 8 bytes long
TIME WITH ZONE
TIMESTAMP
date and time together
TIMESTAMP WITH ZONE
Also large objects, geospatial, XML
Dates stored as integers:
((YEAR - 1900) * 10000) + (MONTH * 100) + DAY
Now: CURRENT_DATE
(just date)
CURRENT_TIME
(just time)
CURRENT_TIMESTAMP
(date and time down to sub-sec)
CAST('28/04/2022' as TIMESTAMP FORMAT 'DD/MM/YYYY')
– Additional column attributes:
NOT NULL
DEFAULT val
WITH DEFAULT --
0 for numeric, space for char
FORMAT 'YYYY/MM/DD' --
remember: single quotes
UPPERCASE
CASESPECIFIC --
force case to matter for comparisons
PRIMARY KEY
COMPRESS --
compresses nulls
COMPRESS 'abc' --
compresses nulls and ‘abc’
COMPRESS ('abc', 'def') --
compresses nulls, ‘abc’ and ‘def’
COMPRESS USING user-defined-function
DECOMPRESS USING user-defined-function
Multi-Value Compression (MVC)
Take up to 255 distinct values out of the stored rows, and store in the table header in each AMP.
Cannot do this on a primary index column
Unique primary index and non unique primary index (duplicate values allowed)
The primary index determines which AMP executes the query (data distribution) via a hashing algorithm
UNIQUE PRIMARY INDEX (col[s]) -- UPI
PRIMARY INDEX (col[s]) -- NUPI
To change the primary index, you have to drop and re-create the table
Secondary indexes are non-primary indexes. They can be unique or not
They are not involved in data distribution
UNIQUE INDEX (col[s]) -- USI
INDEX (col[s]) -- NUSI
Users need only EXEC privilege, not privileges on the individual objects
Single transaction.
REPLACE MACRO abc(p1 INTEGER) -- for VARCHAR, you have to give a length
AS
(
SELECT ... WHERE col = :p1 ; -- ";" after each statement
);
EXEC abc(123);
SHOW MACRO abc;
DROP MACRO abc;
I can do DDL, but it has to be the last statement
It is not necessary to use CREATE
the first time, therefore use REPLACE
instead
CREATE PROCEDURE proc_name (pr 1 data_type, p2 data_type )
BEGIN
<SQL or SPL statements>
SELECT ... WHERE col = :p1 ;
END;
CALL proc_name;
CALL proc_name(1,2);
REPLACE PROCEDURE <scm>.<pnm>
(
IN var_in VARCHAR(30)
,OUT var_out VARCHAR(250)
)
SQL SECURITY INVOKER
MAIN: BEGIN
DECLARE a_var VARCHAR(250);
DECLARE i INTEGER;
SET a_var = ' ';
SET i = 1;
DECLARE cr CURSOR FOR
SELECT ...
FROM ...
;
for_loop: FOR
tab_cr AS cr
CURSOR FOR
SELECT ...
FROM ... -- repeat what is above
;
DO
...
END FOR for_loop;
IF ( SQLSTATE <> '00000') THEN
SET var_out = 'A value';
LEAVE MAIN;
END IF;
WHILE cond
DO
....
END WHILE;
SELECT COUNT(*)
INTO i
FROM a_table;
-- Transaction in Teradata session. Otherwise, in ANSI, use COMMIT; and ROLLBACK;
BT;
update or insert
ET;
CLOSE cr;
SET var_out = 'A value';
END MAIN;
-- This may not be needed because you can execute "in-line"
CALL DBC.SYSEXECSQL('COLLECT STATS ON '||TRIM(a_db)||'.'||TRIM(a_table)||';');
CREATE MULTISET VOLATILE VT_a_table AS
(
SELECT * FROM ...
)
WITH DATA PRIMARY INDEX ( the_id )
ON COMMIT PRESERVE ROWS;
DROP TABLE VT_a_table; -- best practice to explicitely drop
select otranslate('abcd012zZaA09', 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'aaaaaaaaaaaaaaaaaaaaaaaaaa9999999999AAAAAAAAAAAAAAAAAAAAAAAAAA')
<aggregate function> OVER
([PARTITION BY]
[ORDER BY columnname]
[ROWS BETWEEN UNBOUDED PRECEDING AND UNBOUNDED FOLLOWING)
<aggregate function>
can be sum, count, max, min, avg
NOT =
LIKE, NOT LIKE, LIKE ANY ('...', '...'), LIKE SOME ('...', '...'), LIKE ALL ('...', '...')
LIKE 'ABC\_DEF' ESCAPE '\'
"%" for 0 or many, "_" for one character
CASESPECIFIC or NOT CASESPECIFIC: WHERE col (CASESPECIFIC) LIKE 'S%'
SELECT TOP n ...
TOP n PERCENT:
takes n% rows
TOP n WITH TIES
SAMPLE n
: takes n rows
SAMPLE 0.4
: takes 40% of rows
To get a fully random sample:
SAMPLE RANDOMIZED ALLOCATION 0.1
CROSS JOIN
: no ON
clause. "Cross join" is a cartesian join
RANK() OVER(PARTITION BY ... ORDER BY ... DESC)
QUALIFY
: a filter on ordered window (analytical) function:
Seems to be the equivalent of "WHERE
" or "HAVING
" for analytical functions
Example: see the top for each partition:
RANK() OVER(PARTITION BY ... ORDER BY ... DESC) col_alias
QUALIFY col_alias = 1
ROW_NUMBER()
does not have ties, RANK()
has ties
TYPE(col)
returns data type of column
CONCAT(col, col)
or ||
SUBSTRING(S, start_num, length)
SUBSTR:
idem
SUBSTRING(str FROM n FOR m)
old syntax??
TRIM, RTRIM, LTRIM
UPPER, LOWER
INDEX(str, substr)
: returns starting pos of substr in str. 0 if not found\
POSITION: same as INDEX
LENGTH
OTRANSLATE(str, 'abc', 'xyz')
a–>x, b–>y, c–>z
OTRANSLATE(str, 'abc', '')
remove
OREPLACE(str, o, n)
replace every occurrence of o with n
CAST(x AS INTEGER)
verify with TYPE(CAST(... as type))
CAST(str AS DATE FORMAT 'YYYY-MM-DD') B=blank
also: TO_CHAR, TO_NUMBER, TO_DATE
CASE WHEN condition THEN val
ELSE val
END
CASE value WHEN value THEN val
ELSE val
END
COALESCE(val, val)
first non null value in list
NULLIF(expr1, expr2)
Dates:
CAST(str AS DATE FORMAT 'YYYY-MM-DD')
B=blank
TO_DATE(str, 'YYYY-MM-DD')
CURRENT_DATE + INTERVAL '3' YEAR
CURRENT_DATE + INTERVAL '3-01' YEAR TO MONTH or CURRENT_DATE + INTERVAL '03' YEAR + INTERVAL '1' MONTH
CURRENT_TIMESTAMP + INTERVAL '01 05:10' DAY TO MINUTE
This format can be confusing: it is better to do INTERVAL ... + INTERVAL
WHERE dt_fld BETWEEN (CURRENT_DATE - INTERVAL '4' DAY) AND CURRENT_DATE
last 4 days
EXTRACT(YEAR FROM CURRENT_DATE)
– month, day
EXTRACT(HOUR FROM CURRENT_TIMESTAMP)
– minute, second
interval: see doc
Time: 6 bytes, stored as three fields, with 4 bytes for seconds with up to 6 digits after decimal
CAST (col AS TIME WITH TIME ZONE)
SET TIME ZONE INTERVAL -'4' HOUR
Get just the time:
SUBSTRING(TO_CHAR(CAST(EVENTTIME AS TIME)),1,8)
Format: HH, MI, B (blank or space), T=AM/PM, SS, SSDS(one decimal point), Z=timezone
FORMAT 'HHhMImSSs'
–> 14h32h24s
Timestamp: 4 bytes for date, 6 for time
https://docs.teradata.com/r/Getting-Started-Guide-for-Data-Scientists-Using-Python-with-Vantage/December-2020
https://docs.teradata.com/r/Teradata-Package-for-Python-Function-Reference/November-2021
https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/November-2021
import teradataml as t
# or: import * from teradataml
eng = t.create_context(host='192.168.1.223', username='dbc', password='dbc', database="hr")
t.db_list_tables("schema name", "%ou%", "[all|table|view|volatile|temp]") # all parameters are optional
t.db_list_tables("schema_name", "%part_of_table%")
cn=t.get_connection()
for row in cn.execute("select * from dbc.databases where databasename like '%H%';"):
print(row)
def e(s):
for r in t.get_connection().execute(s):
print(r)
e("select trim(databasename), trim(tablename) from dbc.tables where databasename like '%H%' order by 1,2")
# this is very flexible, and is not as picky as creating a dataframe
df = t.DataFrame.from_table(t.in_schema("dbc", "databases"))
df = t.DataFrame.from_query("select sn, city, upper(state) as st from hr.ville_temperature;")
# This creates a temporary table (derived?) and requires the appropriate permissions
df.info()
df.describe()
df.dtypes
df.columns
df.index # null if no primary index on the table
df.shape[0] # number of rows
df.sample(3)
df.count(distinct=True)
df[df.TableKind=="T"].select(["TableKind","tbname","dbname"]).groupby(["TableKind","dbname"]).count().sort(["TableKind","dbname"])
df[(df.col==4) & ~(df.col5==5 | df.col6==4)].select(["col1", "col2"]).groupby(["col1"]).max()
assign(): adds new columns. df.assign(new_col = ...., new_col2 = ...). Add "drop_columns=True" to remove original columns
concat(): concatenate two DataFrams, like a union operation
describe(): generation statistics (profiling)
drop(): drop columns: df.drop(["col1", "col2], axis=1); drop rows: df.drop(df.col==xyz, axis=0), df.drop([index_val1, index_val2], axis=0)
dropna(): looks similar to dropna in spark
filter(): get a subset of rows, or columns
get(): get columns from DataFrame
get_values(): get values in the form of a numpy array
groupby(): .groupby("col") or .groupby(["col1", "col2"]). Note: returns all columns
head(): first n rows
join(): with how taking left, inner, right, full, or cross. df1.join(other = df2, on = [df1.col1==df2.col1b,"col2=col3"], how = "left", lsuffix = "t1", rsuffix = "t2")
map_row(): apply a function to the rows
map_partition(): apply a function to the rows
merge(): with how taking inner, left, right, or full (not cross). df1.merge(right = df2, on = ["col1"=="col1b", "col2=col3"], how = "left", use_index = True, lsuffix = "t1", rsuffix = "t2"). Alternate for on: on = [df1.col1==df2.col1b, df1.col3!=df2.col4]
sample(): .sample(n=10), .sample(frac=0.2), .sample(randomize = True)
select(): select("col"), .select(["col1", "col2"]), .select([["col1", "col2"]])
set_index(): .set_index("col"), .set_index(["col1", "col2"]). Add a column to an index: .set_index("col3", append = True, drop = True)
show_query(): show the underlying query. Without parameter, although df.show_query(full_query = True) may be necessary
sort(): .sort("col"), .sort(["col1", "col2"], ascending=True)
sort_index(): sort
squeeze(): only for one-dimensional objects
tail(): last n rows
df.info()
df.keys(), same as .columns
df.columns: list of column names
df.dtypes: column names and types
df.index: usually the primary index of the underying table or view
df.shape
df.size: number of elements=rows*columns. Get the number of rows with .shape[0]
df.tdtypes: the column names and the data types of the underlying table
df.to_sql('the_name', primary_index='col', set_table=True, if_exists="replace")
DataFrame.from_table(t.in_schema("dbc", "databases"))
Vantage is the new packaging
licensing : developer (no fees), base, advanced, enterprise
Interfaces: TD studio (also has an admin pane), TD studio express, TD SQL Asst (old), BTEQ (download under TD tools and utilities)
joins: left outer join, right outer join, full outer join
cross join: every row on left with every row on right
COLLECT [SUMMARY] STATISTICS INDEX (indexname) COLUMN (columnname) ON <tablename>;
COLLECT STATS ON VT_the_name_of_the_volitile_table PRIMARY INDEX(... key columns ...);
Three ways to collect statistics:
HELP STATISTICS <table name>;
Show statistics
EXPLAIN select * from ... where ..
gives the explain plan
JOIN INDEX = materialized view, for improving performance
virtual box:
to make db available to host, go to the settings of the vm, and select "bridge adaptor" then check the ip address with ifconfig on the vm
Additional topics that need further reading:
echo "# iom" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/yekesys/iom.git
git push -u origin main
create token:
Unstage:
git restore --staged .